agenda - test science...big data history late 1980’s, ibm researchers barry devlin and paul murphy...
TRANSCRIPT
Agenda
NIAM – NASA Information Architecture and Data Management
NASA New Roles with Big Data
Big Data/Big Think
Big Data Tools & Technologies
NASA Examples
NESC Data
» Frangible Joint Data
» Business Data
2
4/11/2017
NIAM
NIAM – NASA Information Architecture and Data Management
» Establish effective information standards, digital strategies, and data management
across the Agency
» This OCIO Program is composed of five primary elements; Architecture and
Standards, Management and Governance, Policy and Legal, User and Data
Services, Sustainment and Outreach
» The intent of the Information Architecture and Data Management Program is to
achieve a variety of capabilities to enable the Agency to efficiently acquire or
generate, find and access, use and reuse, share and exchange, manage and govern,
and store and retire. The Agency Information Architecture defines a consistent
information lifecycle context that overlays all Agency activity, whether it is
engineering, science, operations, or institutional support.
3
4/11/2017
ARCHDATA
OPENDIGITAL TECH
</>Data ScienceThe collection, management and analysis of data in order to produce information and drive decision making.
Data StrategyData Governance Data Lifecycle ManagementData Analytics LabData Fellows ProgramData Stewards
Open InnovationThe development of new innovation frameworks and techniques, and the development and delivery of machine-readable instructions to access, arrange, and apply data.
Agency Open Data MgmtDigital Strategy ReportingSpace Apps ChallengeInnovation IncubatorData Innovation PipelineWomen in Data
Open.NASA, Github/NASAData.NASA, Code.NASAAPI.NASA
Digital Integration The delivery of enhanced digital capability to support data interoperability and accessibility to enable data insights and discovery.
Data inventory/RegistryTagging/discoverabilityData usability/APIsComputing/CodingMission-focused tech applications
Tech InfusionResearch, prototype and assess the creation, modifications and usage of processes or tools to solve a problem, to meet stakeholder needs
Data Centric ArchitectureInternet of ThingsSoftware as a ServiceVirtual Desktop InfraCollaboration (Secure Fileshare, NASATube)
InformationArchitectureThe development and use of models, policies, rules or standards that govern how information is collected, stored, arranged, integrated, and put to use in data systems and in organizations.
Standards and interoperabilityIA Reference ArchitectureData MiningData Contract LanguageMission Support Tech Consulting
Big DataNIAM.NASA.Gov
Tech & Innovation Division--What We Do:
NASA New Roles with Big Data
Data Scientist: NASA has several including one in Office of the Chief
Information Officer
Big Data Administration
Chief Data Officer/Chief Knowledge Officer: Becoming very popular in
Federal Government and in Industry
Big Data Stewards
Data Evangelists
Labs:
» Data Analytics Lab
» Internet of Things Lab4
4/11/2017
More on Big Data Working Group
Over 60 subject matter experts from the NASA Centers and Mission
Directorates
Heavy focus on sharing of latest projects
» techniques/tools used
» avoid duplication
» share licenses when appropriate
» new applications to problems
Best practices, upcoming symposiums, conferences, meetings
Tool demos, commercial or homegrown
5
4/11/2017OCIO Template
Big Data History
Late 1980’s, IBM researchers Barry Devlin and Paul Murphy developed the
"business data warehouse“, while Bill Inmon wrote Building the Data
Warehouse in 1992 (wiki.com)
1997, The term “Big Data” was coined by two Ames Research Center
employees, Michael Cox and David Ellsworth, in the late 90’s because the data
they were processing was too big to run on their desktop computers
July 1999, Predictive Analysis changes business as usual (Winshuttle.com)
February 2001. Doug Laney, an analyst with the Meta Group, publishes a
research note titled “3D Data Management: Controlling Data Volume, Velocity,
and Variety.” This becomes known as the “3Vs (Forbes.com)
2006, Hadoop is created to handle explosion of web data (Gigaom.com)
June 2008, The term “Big Data” began gaining popularity, predicted to “create
revolutionary breakthroughs” (Cra.org)
6
4/11/2017
What Is Big Data?
Definition
Big Data means lots of things all rolled into one phrase
• No single standard definition…
• Some think it’s Data too big to fit in excel!!!
• Some think it’s Data too large to fit on one node
Gartner has defined Big Data as high-volume, high-velocity, high-variety
information assets that demand cost-effective, innovative forms of information
processing for enhanced insight and decision making
Well known definition: Big Data is data whose scale, diversity and complexity
require new architecture, techniques, algorithms, and analytics to manage it
and extract value and knowledge from it
7
4/11/2017
What is Big Data?
5 V’s
Volume
• Huge data size
• Terabytes to Zettabytes of data
• Data coming from records, archives, transactions, tables and files
Velocity
• Streaming data, high speed of data flow
• Speed in which data is generated, processed, and moved in real or near-time
Variety
• Various data sources
• Data in many formats: structured, unstructured, text, multimedia
Veracity
• Trustworthiness, authenticity, reputation, accountability of the source and accuracy
of the data
Value
• Most important function of Big Data is turning data into information 8
4/11/2017
What is Big Data?
to NASA
Big Data describes the situation in which collection, storage, and analysis
of data exceeds the capability and capacity of conventional processing,
storage and methods
Necessitates new architectural approaches
NASA views Big Data challenges from the point of view of data life cycle• Generation, Transport, Stewardship, Processing, Analysis, Access, Dissemination
Requirements for new technologies lie in areas such as:• Data management
• Artificial intelligence
• Instrumentation
• Scalable cyber infrastructure
• Data storage and access methods
• Computational and collaboration systems
• Visualization
• Analytics algorithms
9
4/11/2017
Why Big Data?
What People Do With It
10
4/11/2017
The predominant usage of Big
Data Analytics is to find
relationships between various
unconnected datasets and
databases
» Finding Patterns, “Virtual Schemas,”
which may be temporal / transient –
primarily from web logs
» As opposed to relational databases
(RDBMs) which have pre-defined
relationships between all its data through
the use of a schema
Why Big Data?
Growth of Big Data is possible because of:
• Increased low cost storage, processing power
• Availability of different types of data
• Distributed computing
• Virtualization techniques
• Increased sources/devices spewing data (e.g. Internet of Things)
• Increased ability to capture data (e.g. Hubble Space Telescope)
Data contains information of great business value. If we can
extract the insights, we can make far better decisions
All major companies are developing Big Data strategy to
reduce cost, to substantially improve the time required to
perform computing tasks and to offer new products/services
Businesses and Agencies that don’t take Big Data seriously
will be left behind
11
4/11/2017
12
4/11/2017
Big Data Tools & Technologies
Big Data Landscape
NASA Examples
13
4/11/2017OCIO Template
Langley Research Center’s
Deep Content Analytics and Deep Q&A
Deep Content Analytics: using Watson Content Analytics Technology
Goal is to provide community with natural langue processing technologies that
will quickly make sense of internal/global knowledge by identifying trends and
experts, aiding in discovery, and finding answers to questions with evidence
Key Collections Analyzed:
• Space Radiation • Human-Machine Teaming
• Aerospace Vehicle Design • NASA Lessons Learned
• Carbon Nanotubes • Uncertainty Quantification
• Autonomous Flight • Select NASA Technical Reports
• Model Based Engineering • Space Mission Analysis
Deep Q&A: Using Watson Discovery Advisor Platform14
4/11/2017CIO Templatebigdata.larc.nasa.gov
15
4/11/2017
• Digest and analyze thousands of articles without reading
• Rapidly identify groupings, trends, connections, and experts
• Explore technology gaps that could be leveraged
• Identify cross-domain leverages and research
Autonomous Flight Research
Carbon Nanotubes Research
Space Radiation Research
Knowledge Assistants Using Watson Content Analytics
Analysis of 130,000 articles from
a 20-year time span
Key Capabilities
• Successfully demonstrated value and developed robust expertise and capability
• Buying licensed content is a challenge ; Working with NASA content, open content and
individual researchers collections
Analysis of 4,000 articles
integrating scholarly and web
content
Analysis of 1,000 articles of NASA
research from Human Research
Program Human Machine Teaming, Uncertainty
Quantification, Vehicle Design being worked
16
4/11/2017
NASA Earth eXchange (NEX)
Big Data Experiment
4/11/2017
500+ members2.3 PB of Storage
http://nex.nasa.gov
Distributed Active Archive Centers (DAACs)
Archives, Processes, Documents, and Distributes Data for all of NASA’s Earth
Science Division missions through 12 science discipline focused DAACs
17
4/11/2017OCIO Template
Stores Earth Observing System
(EOS) Data
Feb 2017: Global Imagery Browse
Services (GIBS) team announced
the availability of 80+ new layers
One example: Langley Research
Center’s Atmospheric Science Data
Center (ASDC) supports more than
40 projects and has more than 1800
archived data sets
18
4/11/2017
NASA’S Earth Observing System (EOS)
Earth Science mission data are processed by
Science Investigator-led Processing Systems
designed for missions and measurements
Stewardship of Earth Science Data is conducted by
Distributed Active Archive Centers that provide
knowledgeable curation and science-discipline-
based support
NASA develops tools for users to obtain needed data/information while minimizing burden associated with unwanted data
NASA engages with multiple US and international partners to facilitate use of data by broadest possible community with minimal effort and maximal consistency with other data sources
Open Data Policy (since 1994) - Data are openly available to all and free of charge except where governed by international agreements
19
4/11/2017
Earth Science Data and Information Policy
NASA commits to the full and open sharing of Earth science data obtained from NASA Earth observing satellites, sub-orbital platforms and field campaigns with all users as soon as such data become available.
There will be no period of exclusive access to NASA Earth science data. Following a post-launch checkout period, all data will be made available to the user community. Any variation in access will result solely from user capability, equipment, and connectivity.
NASA will make available all NASA-generated standard products along with the source code for algorithm software, coefficients, and ancillary data used to generate these products.
All NASA Earth science missions, projects, and grants and cooperative agreements shall include data management plans to facilitate the implementation of these data principles.
NASA will enforce a principle of non-discriminatory data access so that all users will be treated equally. For data products supplied from an international partner or another agency, NASA will restrict access only to the extent required by the appropriate Memorandum of Understanding (MOU).
http://science.nasa.gov/earth-science/earth-science-data/data-information-policy/
Example of Data Stored in the DAACS
20
4/11/2017OCIO TemplateImage captured on 29 January 2017 by the VIIRS instrument, on board the joint NASA/NOAA SNPP satellite.
21
4/11/2017
NASA Examples
Counting trees across United States
Mission Impact: First ever high resolution forest cover
dataset. Reduces uncertainties in forest carbon
estimation in support of the NASA Carbon Monitoring
System.
NEX production pipeline example: In this high-resolution image, the NEX team used a
combination of machine-learning and computer vision algorithms to identify urban
tree cover over San Francisco and the surrounding area using data from the National
Agricultural Imaging Program aerial imagery project. This map serves as a critical
input for biomass estimations and a number of eco-physiological models
POCs: Saikat Basu, Louisiana State University
Sangram Ganguly, NASA Ames Research Center
Application of machine learning to
Treecover/Landcover classification
from very high-resolution data (1-m
from NAIP)
Combination of feature extraction,
normalization and Deep Belief Network
Input size – 70TB per epoch
Feature vectors – 3PB
22
4/11/2017
Small Data Turns Big
What problem or problems are we solving?» Structured Data (Financials)
• Labor cost
• Financial cost
• Duration vs cost
• Labor count vs cost
• External Labor vs duration
» Unstructured Data & Structured (Technical)
• Assessment similarities by technical data
• Findings
• Observations
• Recommendations
• Lessons Learned
Financial with Technical Engineering Data
Who needs the problem solved
» Management
» Cost analysts
» Engineers
23
4/11/2017
Extract, Transform, and Load (ETL)
NESC Assessments
Financial
DataTechnical
Data
And
Show
Transformed
Unstructured
Structured
Integrating Data
Texts Analyti
cs
Financial
Technical
24
4/11/2017
Small Data Turns Into Big Data
Transformed
Unstructured Data
Structured Data
About 1000
Assessments
1,000,000 Plus
Rows of Data
Assessment
Findings
Observation
Recommendations
Lesson Learned
Textual Analytics
25
4/11/2017
Visual Analytics
26
4/11/2017
Frangible Joint Task
27
4/11/2017
Frangible Joint Task
The data complexity and output format required some basic data to database
translation to improve quality control
» Standardize units of measure, data field names, and make data
corrections/updates, as required.
» The Database team systematically reviewed and coordinated specific
terminology with FJ Assessment Test team
» The Database team and Test teams worked closely and this collaboration
led to a uniform process for building the database.
28
4/11/2017
Frangible Joint Task
Visual basic was used to create application macros to extract and prepare the
data - lessening process time and improving repeatability.
The users were provided an interface to query the database.
The data can be exported in various formats to a spreadsheet or other
analysis tools.
The Database team also provided specific reporting formats, per user
requirements.
» There were some limitations on how much data of the 113,000,000 records could be
downloaded at one time because of the user computer and/or software limitations.
For example, Excel can only view 1,000,000 rows of data if using the 64-bit operating
system and much less if a 32-bit operating system is used.
29
4/11/2017
Frangible Joint Task, Main Report Page
30
4/11/2017
Frangible Joint Task, Report Example
31
4/11/2017
Backup
32
4/11/2017
Introduction to Big Data
(aka: Big Data 101)
John D. Sprague
Acting, Associate CIO,
Technology & Innovation Division
10/07/2016
Introduction to Big Data
(aka: Big Data 101)
John D. Sprague
Acting, Associate CIO,
Technology & Innovation Division
10/07/2016
Big Data Return On Investment
A few early results:
» Visa sped up crunching 36TB of data from one month to 13 minutes with
Hadoop in the cloud
» Google’s self learning voice recognition via probability calculations reduced
errors by up to 25 percent
» Nestle saves $1B annually by analyzing big data
» Best Buy found that 7% of its customers accounted for 43% of sales and
reorganized the stores around them
» Wal-Mart organizes in real-time from 2.5 PB databases
» JPL sped up processing of data from two weeks to four hours by using the
cloud 34
4/11/2017
Big Data Tools & Technologies
Big Data Landscape
35
4/11/2017
NASA Open Gov
Increase Agency transparency
and accountability to external
stakeholders
Enable citizen participation in
NASA missions (prizes and
challenges, citizen science)
Improve internal NASA
collaboration and innovation
36
4/11/2017OCIO Template
Encourage partnerships that can create economic opportunity
Institutionalize Open Gov philosophies and practices at NASA
Located: https://open.nasa.gov/open-gov/
NASA's Pleiades Supercomputer
State-of-the-art tech for meeting the agency's supercomputing requirements,
enabling scientists and engineers to conduct mission modeling and simulation.
5.95 Pflop/s LINPACK rating (#13 on November 2016 in the world).
NASA’s Interface Region Imaging Spectrograph (IRIS) spacecraft flying above
the solar surface at 6,000 miles, showing only light emitted by
plasma at a temperature of about 35,000 degrees Fahrenheit.
37
4/11/2017OCIO Templatehttps://www.nas.nasa.gov/hecc/resources/pleiades.html
Predictive Modeling for NASA’s Entry,
Descent, and Landing (EDL) Missions.
Image Credit: Mats Carlsson,
University of Oslo
38
4/11/2017
Unmanned Missions
Hubble Space Telescope, deployment in April 1990
First major optical telescope to be placed in space
More than 1.3 M observations, 14K scientific papers
Hubble archive contains more than 120 Terabytes,
and Hubble science data processing generates about
10 Terabytes of new archive data per year
4/11/2017
James Web Space Telescope (JWST)
» Space-based observatory, optimized for infrared
wavelengths, which will complement and extend
the discoveries of the Hubble. Launches in 2018
» 6 times larger in area than Hubble
Moon Laser
(Lunar Laser Communication Demonstration)
Pulsed laser beam set record in data transmission
» Transmitted data over 239,000 miles between moon
and Earth, record-breaking download rate of 622 Mbps
Developed by MIT’s Lincoln Laboratory, Hosted on LADEE
Prep for Laser Communications Relay Demo (LCRD)
» Transmit data 10 to 100 times faster than the fastest
radio-frequency systems, Uses significantly less mass/power
» Being built at NASA’s Goddard Space Flight Center
» First time NASA has contracted to fly a payload on an American-manufactured
commercial communications satellite, launching in mid-201939
4/11/2017OCIO Template
40
4/11/2017
NASA Examples – Extra Vehicular Activity
(EVA) Data Integration
Just Completed Phase 2
State-of-the-art Platform using Tools:
• Core platform based on ELK Stack
• Prelert for Anomaly Detection
• Hosted in AWS Government Cloud
• Various databases including
RethinkDB, Neo4J, Amazon RDS, &
PostgresSQL
• APIs deployed as micro services
using Docker container
• Amazon S3 for document storage
• Integrated with jBPM workflow
engine
• UI framework: HTML5, Angular,
Bootstrap, Materialize CSS
• 2D and 3D models for browsing
Access to all EVA data – Search, Browse, Navigate & Analyze19
42
4/11/2017
Vehicle Exploration Medical System
Ground Based & Vehicle Data Architectures:
• Clinical Operational Needs
• Research Data Capture
• Long Term Health Information
• Crew Medical Officer
• Crew Medical Support
• Flight Surgeon/BME
• External Consults
Real-Time Data Processing &
Analytics for Crew
Mirrored Delayed Data Presentation
for situational awareness/support
Exploration Medical Capabilities (ExMC)
Data Analytics
Technologies:
• Apache Spark
• Kafka for messaging
• D3 Visualization
• cFS (Core Flight
System)
• TimeSeries and
RocksDB databases
• Machine Learning,
Speech Recognition and
Natural Lang Processing
• Processing sensor data
including Bio and env.
sensors
Other Resources
Get involved
• Join the NASA Big Data Working Group– Meets monthly, 2nd Thursday of each month
– Notes are captured on the http://niam.nasa.gov/big-data-2/ website (only accessible from the NASA corporate
network or VPN)
• Big Data Face to Face Work Sessions (1 or 2 a year, rotating Centers)
Training
• MIT’s Tackling the Challenges of Big Data online course is considered the gold
standard
• In the works: an advanced Big Data SATERN course
Apache
• Various Open Source Big Data tools by Apache
Massive Open Online Courses (MOOCs)
• Coursera, Udacity, Udemy, Big Data University and edX43
4/11/2017