nist big data public working group definition and taxonomy subgroup presentation september 30, 2013...
TRANSCRIPT
![Page 1: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/1.jpg)
NIST Big Data Public Working Group
Definition and Taxonomy Subgroup PresentationSeptember 30, 2013
Nancy Grady, SAIC Natasha Balac, SDSCEugene Lister, R2AD
![Page 2: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/2.jpg)
Definition and Taxonomy9/30/13
Overview
• Objectives• Approach• Big Data Component Definitions• Data Science Component Definitions• Taxonomy– Roles– Activities– Components– Subcomponents
• Templates• Next Steps
2
![Page 3: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/3.jpg)
Definition and Taxonomy9/30/13 3
Objectives
• Identify concepts• Focus on what is new and different• Clarify terminology• Attempt to avoid terms that have domain-specific
meanings• Remain independent of specific implementations
![Page 4: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/4.jpg)
Definition and Taxonomy9/30/13 4
Approach
• Hold scope to what is different because of Big Data– Use additional concepts needed for completeness
• Restrict terms to represent single concepts• Don’t stray too far from common usage• In the report go straight to Big Data and Data Science– This presentation will start from more elemental concepts
• Relationship to cloud, but not required
![Page 5: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/5.jpg)
Definitions
Big DataData Science
5
![Page 6: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/6.jpg)
Definition and Taxonomy9/30/13 6
Concepts Relating to Data
• Data Type (structured, semi-structured, unstructured)– Beyond our scope (and not new)
• Data Lifecycle– Raw Data– Usable Information– Synthesized Knowledge– Implemented Benefit
• Metadata: data about data or system or processing– Provenance: Data Lifecycle history
• Complexity: dependent relationships across data elements
![Page 7: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/7.jpg)
Definition and Taxonomy9/30/13 7
Concepts Relating to Dataset at Rest
• Volume: amount of data• Variety: many data types – and also across data domains
• Persistence: storing in {flat files, RDBMS, NoSQL, markup,…}
• NoSQL– Big Table– Name-value– Graph– Document
• Tiered storage {in-memory, cache, SSD, hard disk, …}• Distributed {local, multiple local, network-based}
![Page 8: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/8.jpg)
Definition and Taxonomy9/30/13 8
Concepts Related to Dataset in Motion
• Velocity: rate of data flow• Variability: change in rate of data flow, also– Structure– Refresh rate
• Accessibility: new concept of Data-as-a-Service• Transport formats (not new)• Transport protocols (not new)
![Page 9: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/9.jpg)
Definition and Taxonomy9/30/13 9
Big Data Analogy to Parallel computing
• Processor improvements slowed• Coordinate a loose collection of processors• Adds resource communication complexities – System clocks– Message passing
• Distribution of processing code• Distribution of data for processing nodes
![Page 10: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/10.jpg)
Definition and Taxonomy9/30/13 10
Big Data - Jan 15-17 NIST Cloud/Big Data Workshop
Big Data refers to digital data volume, velocity, and/or variety that:• Enable novel approaches to frontier questions previously
inaccessible or impractical using current or conventional methods; and/or
• Exceed the storage capacity or analysis capability of current or conventional methods and systems.
• Differentiates by storing and analyzing population data and not sample sizes
![Page 11: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/11.jpg)
Definition and Taxonomy9/30/13 11
Refinements are Welcome
• The heart of the change is the scaling– Data seek times increasing slower than Moore’s Law– Data volumes increasing faster than Moore’s Law
• Implies the addition of horizontal scaling to vertical scaling– Data analogous to MPP processing changes
• Difficult to define as– An implication of engineering changes– Data Lifecycle process order changes– Implication of a new type of analytics– As moving the processing to the data not the data to the
processing
![Page 12: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/12.jpg)
Definition and Taxonomy9/30/13 12
Big Data Analytics Characteristics
Analytics Characteristics are not new• Value: produced when the analytics output is put into
action• Veracity: measure of accuracy and timliness • Quality: – well-formed data– Missing values– cleanliness
• Latency: time between measurement and availability• Data types have differing pre-analytics needs
![Page 13: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/13.jpg)
Definition and Taxonomy9/30/13 13
Data Science as a Science Progression
Coined the “Fourth Paradigm” by the late Jim Gray• Experiment: Empirical measurement science• Theory: Causal interpretation – Explains experiments– Calculates measurements that would confirm the
theoretical models
• Simulation: Performing theory (model)-driven experiments that are not empirically possible
• Data Science: Empirical analysis of data produced by processes
![Page 14: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/14.jpg)
Definition and Taxonomy9/30/13 14
Data Science Analogy (simplistically)
• Statistics– precise deterministic causal analysis – over precisely collected data
• Data Mining: – deterministic causal analysis – over re-purposed data that has been carefully sampled
• Data Science– Trending or correlation analysis– Over existing data that typically uses the bulk of the
population
![Page 15: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/15.jpg)
Definition and Taxonomy9/30/13 15
Data Science
• Data Science is the extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and analytical hypothesis analysis.
• A Data Scientist is a practitioner who has sufficient knowledge of the overlapping regimes of expertise in business needs, domain knowledge, analytical skills and programming expertise to manage the end-to-end scientific method process through each stage in the big data lifecycle (through action) to deliver value.
![Page 16: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/16.jpg)
Definition and Taxonomy9/30/13 16
Data Science Skillsets
![Page 17: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/17.jpg)
Definition and Taxonomy9/30/13 17
Data Science Addendums
• Is not just Analytics• The end-to-end data system is the equipment• The analytics over Big Data can be– Exploratory or discovery-driven for hypothesis generation– Focused hypothesis verification– Focused on operationalization
![Page 18: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/18.jpg)
Taxonomy
ActorsRoles
ActivitiesComponents
Subcomponents
18
![Page 19: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/19.jpg)
Definition and Taxonomy9/30/13 19
Big Data Taxonomy
• Actors• Roles• Activities• Components• Sub-components
![Page 20: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/20.jpg)
Definition and Taxonomy9/30/13 20
Actors
• Sensors• Applications• Software agents• Individuals• Organizations• Hardware resources• Service abstractions
![Page 21: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/21.jpg)
Definition and Taxonomy9/30/13 21
System Roles
• Data Provider – makes available data internal and/or external to the system
• Data Consumer – uses the output of the system• System Orchestrator – governance, requirements,
monitoring• Big Data Application Provider – instantiates
application• Big Data Framework Provider – provides resources
![Page 22: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/22.jpg)
Definition and Taxonomy9/30/13 22
Roles and Actors
![Page 23: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/23.jpg)
Definition and Taxonomy9/30/13 23
Data Provider
![Page 24: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/24.jpg)
Definition and Taxonomy9/30/13 24
System Orchestrator
![Page 25: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/25.jpg)
Definition and Taxonomy9/30/13 25
Big Data Application Provider
![Page 26: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/26.jpg)
Definition and Taxonomy9/30/13 26
Big Data Framework Provider
![Page 27: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/27.jpg)
Definition and Taxonomy9/30/13 27
Data Consumer
![Page 28: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/28.jpg)
Definition and Taxonomy9/30/13 28
Big Data Security
![Page 29: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/29.jpg)
Definition and Taxonomy9/30/13 29
Big Data Application Provider
![Page 30: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/30.jpg)
Definition and Taxonomy9/30/13 30
Data Lifecycle Processes
Collect
Analyze
Need
CurateAct &
Monitor
Data
Information
Knowledge
Benefit
Goal
Evaluate
![Page 31: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/31.jpg)
Definition and Taxonomy9/30/13 31
Data Warehouse Template– store after curate
Domain
Cleanse Transform
ETL Action
Warehouse
Summarized Data
Algorithm
AnalyticMart
COLLECT CURATE ANALYZE ACT
Staging
ETL = extract, transform, load
![Page 32: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/32.jpg)
Definition and Taxonomy9/30/13 32
Volume template – store raw data after collect
Raw Data
Cluster
Model Building
Model Analytics
Data Product
Map/R
educe
Mart
Model Data
COLLECT CURATE ANALYZE ACT
Volume
ComplexityDomain
Cleanse
Transform
Analyze
![Page 33: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/33.jpg)
Definition and Taxonomy9/30/13 33
Velocity Template – store after analytics
COLLECT CURATE ANALYZE ACT
Enriched Data Cluster
Velocity
Volume
Alerting
Domain
CleanseTransform
![Page 34: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/34.jpg)
Definition and Taxonomy9/30/13 34
Variety Template – Schema-on-Read
AnalyzeC
om
mon Q
uery
FusedData
COLLECT CURATE ANALYZE ACT
Variety Complexity
Map/R
educe
Query
![Page 35: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/35.jpg)
Definition and Taxonomy9/30/13 35
Analysis to Action Template
• Seconds – Streaming Real-time Analytics• Minutes– Batch jobs of operational model• Hours – Ad-hoc analysis• Months – Exploratory analysis
![Page 36: NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister,](https://reader030.vdocument.in/reader030/viewer/2022013004/56649d9c5503460f94a85398/html5/thumbnails/36.jpg)
Definition and Taxonomy9/30/13 36
Possible Next Steps
• Refinement Big Data Definition• Word-smithing of all definitions• Refinement Taxonomy Mindmap for completeness• Exploration of Templates for categorization• Data distribution templates according to CAP compliance• Measures and Metrics (how big is Big Data)