getting rid of data - vldb · production of data & storage tova milo getting rid of data -...
TRANSCRIPT
![Page 1: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/1.jpg)
Getting Rid of Data
Tova Milo
Tel Aviv University
![Page 2: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/2.jpg)
The Big Data Era
From sports,to health care, to the way we drive our cars,or choose how to invest our money,…Big Data is changing every aspect of our lives.
Tova Milo GETTING RID OF DATA - VLDB’19 2
![Page 3: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/3.jpg)
The Big Data Era
The data-centered revolution is fueled by the masses of data, but at the same time is at a great risk due to the very same information flood.
Tova Milo GETTING RID OF DATA - VLDB’19 3
![Page 4: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/4.jpg)
Time to stop and rethink the “More Data!” philosophy.
The 3 P’s to worry about:
Tova Milo GETTING RID OF DATA - VLDB’19 4
Production
Privacy
Performance
The Big Data Era
![Page 5: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/5.jpg)
Production ofData & Storage
Tova Milo GETTING RID OF DATA - VLDB’19 5
The size of our digital universe grows exponentially
Forecast [IDC’17]:
“By 2025 the global datasphere will grow to 163 zettabytes (trillion giga), ten times the 16.1 ZB of data generated in 2016.”
Updated forecast [IDC’18]:
“By 2025 the global datasphere will grow to 175 zettabytes, from the 33 ZB in 2018”
Storage demand is estimated to outstrip production by more than double!
Production
Privacy
Performance
![Page 6: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/6.jpg)
Data Size
Tova Milo GETTING RID OF DATA - VLDB’19 6
Production
Privacy
Performance
![Page 7: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/7.jpg)
How Much is175 ZB?
Tova Milo GETTING RID OF DATA - VLDB’19 7
“If one were able to store 175ZB onto BluRay discs, then you’d have a stack of discs that can get you to the moon 23 times…”
“Even if you could download 175ZB on today’s largest hard drive it would take 12.5 billion drives (and as an industry, we ship a fraction of that today.)”
Production
Privacy
Performance
![Page 8: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/8.jpg)
Storage Production
Tova Milo GETTING RID OF DATA - VLDB’19 8
Production
Privacy
Performance
![Page 9: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/9.jpg)
Data vs. Storage
Tova Milo GETTING RID OF DATA - VLDB’19 9
5 ZB
Production
Privacy
Performance
![Page 10: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/10.jpg)
Performance
Handling exponentially growing data incurs a substantial maintenance and processing overhead
• data cleaning,
• validation,
• enhancement,
• analysis,…
Selective data management is key to performance !
Tova Milo GETTING RID OF DATA - VLDB’19 10
Production
Privacy
Performance
![Page 11: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/11.jpg)
Let’s Think Energy…
Tova Milo GETTING RID OF DATA - VLDB’19 11
Production
Privacy
Performance
![Page 12: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/12.jpg)
Let’s Think Energy…
Tova Milo GETTING RID OF DATA - VLDB’19 12
Production
Privacy
Performance
![Page 13: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/13.jpg)
Energy Optimization ?
Over the last few years:
Development of better ways to cool data centers
Recycling the waste heat
Streamlining computing processes
Switching to renewable energy
Still, even in the best-scenario predictions, if we don’t learn how to dispense of data we’ll stay at the same consumption level (which is already high)
Tova Milo GETTING RID OF DATA - VLDB’19 13
Production
Privacy
Performance
![Page 14: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/14.jpg)
Privacy and Security
Even if we disregard storage and performance constraints, uncontrolled data retention dangers privacy & security
EU Data Protection Regulation (GDPR).
Sarbanes-Oxley, Graham-Leach-Bliley, the Fair and Accurate Credit Transactions Act, HIPAA,…
Data disposal/retention policies must be systematically developed and enforced to benefit and protect organizations and individuals.
Tova Milo GETTING RID OF DATA - VLDB’19 14
Production
Privacy
Performance
![Page 15: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/15.jpg)
1) Not all data is important!
2) People fear of loosing potentially important data
3) Already now, sometimes there is really no choice
4) Like most good ideas, we are not the first to think about this …
Tova Milo GETTING RID OF DATA - VLDB’19 15
Before we continue,4 important notes
Production
Privacy
Performance
![Page 16: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/16.jpg)
1) Not all data is important!
2) People fear of loosing potentially important data
3) Already now, sometimes there is really no choice
4) Like most good ideas, we are not the first to think about this …
Martin Kersten,"The Wildest Idea" Award, CIDR’15 Gong Show, for "Big Data Space Fungus"
Tova Milo GETTING RID OF DATA - VLDB’19 16
Before we continue,4 important notes
Production
Privacy
Performance
![Page 17: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/17.jpg)
Big Data Space Fungus
Tova Milo GETTING RID OF DATA - VLDB’19 17
Production
Privacy
Performance
[CIDR’15]
![Page 18: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/18.jpg)
Big Data Space Fungus
Tova Milo GETTING RID OF DATA - VLDB’19 18
Production
Privacy
Performance
[CIDR’15]
![Page 19: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/19.jpg)
Big Data Space Fungus
Tova Milo GETTING RID OF DATA - VLDB’19 19
Production
Privacy
Performance
[CIDR’15]
![Page 20: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/20.jpg)
Retaining the knowledge hidden in the data while respecting storage, processing and regulatory constraints
Determine an optimal disposal policy (which data to retain, summarize, dispose off) and execute it efficiently
Support full-cycle information processing over the partial data
Incrementally maintain the partial data as new info comes in
Tova Milo GETTING RID OF DATA - VLDB’19 20
Production
Privacy
Performance
The Data Disposal Challenge
![Page 21: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/21.jpg)
The 7 Criteriafor Disposing Data
What makes a piece of data important?
How importance changes over time?
Which of the data is important?
Which data can (or must) be retained/disposed off? When?
What is the cost of retaining / disposing off the data ?
How can data be summarized / disposed off?
How to process the partial data?
Tova Milo GETTING RID OF DATA - VLDB’19 21
Production
Privacy
Performance
![Page 22: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/22.jpg)
1. Existing tools(and why they are not enough)
2. Understanding the past(provenance)
3. Predicting the future(Deep Reinforcement Learning)
22
The Rest of This Talk
![Page 23: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/23.jpg)
(Very) Incomplete ListDeduplication
Entity resolution
(Semantic) compression & summarization
Relations
Semi-structured (XML, RDF, graph)
Unstructured (text)
Sampling
Approximate Query Processing
Sketching
Streams
Machine Learning
Dimensionality reduction
Clustering
Features selection
Tova Milo GETTING RID OF DATA - VLDB’19 23
![Page 24: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/24.jpg)
Example 1: Relations
Tova Milo GETTING RID OF DATA - VLDB’19 24
[Jagadish, Ng, Ooi, Tung, ICDE'04]
Back to the late 90’s…
![Page 25: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/25.jpg)
Example 2: Graphs
Tova Milo GETTING RID OF DATA - VLDB’19 25
[Song, Wu, Lin, Dong, Sun, TKDE‘18]
![Page 26: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/26.jpg)
Example 3: Sampling for AQP
Approximate query answers, at a fraction of full execution cost
In query-time sampling, the query is evaluated over samples taken from the database at run time.
For a sharper reduction on response time, draw samples from the data in a pre-processing step
Question 1: Sample also from the data summaries?
Question 2: Use the precomputed samples as data summaries, thereby allowing to discard some (or all) of the remaining items?
Tova Milo GETTING RID OF DATA - VLDB’19 26
[Chaudhuri, Ding, Kandula, SIGMOD‘17]
![Page 27: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/27.jpg)
Common Objectives
Summary properties
Conciseness
Diversification
Coverage
Accuracy w.r.t query results
Concrete queries
Queries class/workload
Information loss
Tova Milo GETTING RID OF DATA - VLDB’19 27
[Orr, Suciu, Balazinska, VLDB‘17]
![Page 28: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/28.jpg)
But in Practice…
Workloads are far more complex(cleaning, transformation, integration, ML,…)
Tova Milo GETTING RID OF DATA - VLDB’19 28
![Page 29: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/29.jpg)
But in Practice…
Workloads are far more complex(cleaning, transformation, integration, ML,…)
Need to understand how data is manipulated, summarized, disposed off throughout the entire workload !
Tova Milo GETTING RID OF DATA - VLDB’19 29
![Page 30: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/30.jpg)
1. Existing tools(and why they are not enough)
2. Understanding the past(provenance)
3. Predicting the future(Deep Reinforcement Learning)
30
The Rest of This Talk
![Page 31: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/31.jpg)
Data Provenance
Tracks computation and reveals the “origin” of results
Many different models with different granularities
Can be a key for performing & understanding data reduction
Tova Milo GETTING RID OF DATA - VLDB’19 31
![Page 32: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/32.jpg)
Provenance by Example
Tova Milo GETTING RID OF DATA - VLDB’19 32
![Page 33: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/33.jpg)
Lineage
Tova Milo GETTING RID OF DATA - VLDB’19 33
![Page 34: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/34.jpg)
Provenance Polynomials
Tova Milo GETTING RID OF DATA - VLDB’19 34
![Page 35: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/35.jpg)
Provenance Polynomials
Tova Milo GETTING RID OF DATA - VLDB’19 35
![Page 36: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/36.jpg)
Workflow Provenance
Tova Milo GETTING RID OF DATA - VLDB’19 36
![Page 37: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/37.jpg)
Many Applications
• Results Explanation
• Hypothetical reasoning
• Trust level assessment
• Computation in presence of incomplete/probabilistic info.
• Data reduction [Gershtein, M, Novgorodov, CIKM’19]
• …
Tova Milo GETTING RID OF DATA - VLDB’19 37
![Page 38: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/38.jpg)
But…
Provenance is HUGE
Tova Milo GETTING RID OF DATA - VLDB’19 38
![Page 39: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/39.jpg)
Provenance Reduction
Lossless
Size reduction via expression simplification/factorization
(e.g. using Boolean circuits)
Lossy
Selective provenance
Compression via abstraction
Tova Milo GETTING RID OF DATA - VLDB’19 39
![Page 40: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/40.jpg)
Example:Compression by Abstraction
Tova Milo GETTING RID OF DATA - VLDB’19 40
[Deutch, Moskovitch, Rinetzky SIGMOD’19]
![Page 41: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/41.jpg)
Example:Compression by Abstraction
Tova Milo GETTING RID OF DATA - VLDB’19 41
![Page 42: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/42.jpg)
Example:Compression by Abstraction
Tova Milo GETTING RID OF DATA - VLDB’19 42
![Page 43: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/43.jpg)
Example:Compression by Abstraction
Tova Milo GETTING RID OF DATA - VLDB’19 43
![Page 44: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/44.jpg)
Optimization Problem
• Choose a cut in the ontology that maximizes expressiveness for a target compression ratio
• NP-hard in general
• Polynomial time complexity for a single ontology
• Practically appealing heuristics for the general case
Tova Milo GETTING RID OF DATA - VLDB’19 44
Expressiveness
Size
![Page 45: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/45.jpg)
1. Existing tools(and why they are not enough)
2. Understanding the past(provenance)
3. Predicting the future(Deep Reinforcement Learning)
45
The Rest of This Talk
![Page 46: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/46.jpg)
Learn what may be interesting in a new dataset
Tova Milo GETTING RID OF DATA - VLDB’19 46
Exploratory data analysis (EDA):
The process of examining & investigating a given dataset
![Page 47: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/47.jpg)
Exploratory Data AnalysisEEDA is an iterative process:
A user u loads a dataset D to an analysis interface.
Performs a sequence of: Su(D)= q1, q2,…qn of actions (e.g. queries)
After executing qi - the user examines the results, and decides if and which action to perform next.
The goal:
Understand the nature of the dataset
Discover its properties
Estimate its quality
Figure our what may be interesting in it
Modern analysis platforms (e.g. Splunk, Kibana-ELK, Tableau, …)
Tova Milo GETTING RID OF DATA - VLDB’19 47
![Page 48: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/48.jpg)
EDA agent
Can we teach a machine to generate a coherent, meaningful sequence of exploratory queries?
Tova Milo GETTING RID OF DATA - VLDB’19 48
![Page 49: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/49.jpg)
Deep Reinforcement Learning
DRL works surprisingly well for very difficult tasks:
Play Go
Drive a car
Conduct natural language dialogs
……
Tova Milo GETTING RID OF DATA - VLDB’19 49
![Page 50: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/50.jpg)
Can/Should we use DRL?
PROS:
It requires NO training data OR traces of user activity
Once trained - results can be obtained rather FAST.
CONS:
It is a heavy-weight tool, requires lots of computing power.
Currently works mostly on game-like environments
Even when working - it may just overfit to some odd patterns in the data
Tova Milo GETTING RID OF DATA - VLDB’19 50
![Page 51: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/51.jpg)
1. Quick recap of standard RL settings
2. Requirements for RL-EDA environment
3. Our framework (ongoing work)
Tova Milo GETTING RID OF DATA - VLDB’19 51
The Rest of This Talk
![Page 52: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/52.jpg)
RL Standard Settings
In the (not so simple) Atari environment:
Tova Milo GETTING RID OF DATA - VLDB’19 52
1. Agent observes a “State”
from an “environment”
2. Agent selects an “action”
3. Agent receives “reward”
4. Agent learns (unsupervised)
a “policy” that maximizes
the mean reward
![Page 53: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/53.jpg)
RL-EDA Settings
Tova Milo GETTING RID OF DATA - VLDB’19 53
Utilizing the RL paradigm for EDA:
1. Agent observes a dataset/results set
2. Agent formulates a query
3. Agent receives reward
4. Agent learns to maximize the reward
![Page 54: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/54.jpg)
1.RL-EDA environment
2. State and action representation
3. Reward Signal
4. Agent NN-Architecture
Tova Milo GETTING RID OF DATA - VLDB’19 54
Outline for an RL-EDA Framework
![Page 55: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/55.jpg)
1.RL-EDA environment
2. State and action representation
3. Reward Signal
4. Agent NN-Architecture
Tova Milo GETTING RID OF DATA - VLDB’19 55
Outline for an RL-EDA Framework
![Page 56: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/56.jpg)
RL-EDA Environment
RL-EDA environment comprises:
(1) A collection of datasets
(2) Query interface
RL-EDA Episode:
The agent is “given” an arbitrary dataset
The agent performs a “session” (sequence) of N queries.
Tova Milo GETTING RID OF DATA - VLDB’19 56
![Page 57: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/57.jpg)
1.RL-EDA environment
2. State and action representation
3. Reward Signal
4. Agent NN-Architecture
Tova Milo GETTING RID OF DATA - VLDB’19 57
Outline for an RL-EDA Framework
![Page 58: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/58.jpg)
State Representation
Tova Milo GETTING RID OF DATA - VLDB’19 58
Result displays are often large and complex…
→ Summarize the results display into a numeric vector
Structural features of the data:
Value entropy, # of distinct values, # of Null values
Grouping/Aggregation features:
# of groups, groups size variance, aggr. values, entropy,…
Context:
N previous displays
![Page 59: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/59.jpg)
1.RL-EDA environment
2. State and action representation
3. Reward Signal
4. Agent NN-Architecture
Tova Milo GETTING RID OF DATA - VLDB’19 59
Outline for an RL-EDA Framework
![Page 60: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/60.jpg)
Action RepresentationParameterized Actions (action type + parameters)
• FILTER(attr, op, term) - used to select data tuples that matches a criteria
• GROUP(attr, agg func, agg attr) - groups and aggregates the data
• BACK() - allows the agent to backtrack to a previous display
Our Representation
• [action_type, attr, op, term, agg_func, agg_attr]
• Handle filter terms using the frequency of appearances in the display
Issue: large actions domain
Tova Milo GETTING RID OF DATA - VLDB’19 60
![Page 61: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/61.jpg)
1.RL-EDA environment
2. State and action representation
3. Reward Signal
4. Agent NN-Architecture
Tova Milo GETTING RID OF DATA - VLDB’19 61
Outline for an RL-EDA Framework
![Page 62: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/62.jpg)
Reward Signal
Given a sequence SD= q1, q2,…qn of queries performed by the agent on dataset D. How to determine the reward R(SD)?
We suggest three major components.
1. Interestingness: Actions inducing interesting results set should be encouraged
2. Diversity: Actions in the same session should yield diverseresults describing different aspects of the dataset
3. Coherency: The session is understandable to human analysts
Tova Milo GETTING RID OF DATA - VLDB’19 62
![Page 63: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/63.jpg)
Interestingness
Tova Milo GETTING RID OF DATA - VLDB’19 63
Multitude of interestingness measures are suggested in previous work.
Each captures a different aspect of interestingness:
DiversityMeasures how much the elements of a data
pattern are different from on another
PecularityMeasures how anomalous is a pattern
comparing to the rest of the data patterns
ConcisenessMeasures the size of the pattern compared
to its coverage
NoveltyMeasures how unexpected a data pattern is
w.r.t. known prior knowledge
![Page 64: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/64.jpg)
Diversity
Goal: encourage the agent to choose actions inducing new observations of different parts of the data than those examined so far
Solution: calculate the Euclidean distances between the observation vector of the current results display and the vectors of all previous displays
64Tova Milo GETTING RID OF DATA - VLDB’19
![Page 65: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/65.jpg)
Coherency
Performed using an external classifier:
1. Given the dataset schema & application domain we use a set of heuristic classification-rules composed by domain experts(e.g. “a group-by that is employed on more than 4 attributes is non-coherent”)
2. Then employ Snorkel to build a weak-supervision based classifier
65Tova Milo GETTING RID OF DATA - VLDB’19
![Page 66: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/66.jpg)
1.RL-EDA environment
2. State and action representation
3. Reward Signal
4. Agent NN-Architecture
Tova Milo GETTING RID OF DATA - VLDB’19 66
Outline for an RL-EDA Framework
![Page 67: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/67.jpg)
ChallengesLarge # of actions
(in particular due to the Filter parameter)
Exploration challenges: imbalanced action types (BACK, GROUP, FILTER)
Our solution: parameterized softmax with pre-output layer
67Tova Milo GETTING RID OF DATA - VLDB’19
![Page 68: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/68.jpg)
A few words about experimental evaluation
1. Learning curves and reward
2. Competitors: Greedy, Recommender systems, Human…
3. Measures: BLEU, sessions similarity
“Turing test”
Tova Milo GETTING RID OF DATA - VLDB’19 68
![Page 69: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/69.jpg)
Time to Conclude…
Tova Milo GETTING RID OF DATA - VLDB’19 69
![Page 70: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/70.jpg)
Time to Conclude…
The Data Disposal Challenge
Determine an optimal disposal policy (which data to retain, summarize, dispose off) and execute it efficiently
Support full-cycle information processing over the partial data
Incrementally maintain the partial data as new info comes in
Define formally what makes a disposal policy good…
Tova Milo GETTING RID OF DATA - VLDB’19 70
![Page 71: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/71.jpg)
Time to Conclude…
1. Plenty of relevant tools
2. But still very far from a comprehensive solution
3. ML agents: Still a lot to do here!
Support more data analysis actions
Adaptive disposal policies based on user interaction
Consider potential data exploration goals
Tova Milo GETTING RID OF DATA - VLDB’19 71
![Page 72: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By](https://reader031.vdocument.in/reader031/viewer/2022040409/5ec7f4460874e9695b486af6/html5/thumbnails/72.jpg)
Thank You
72
Ori Bar-El, Naama Boer, Daniel Deutch, Shay Gershtein, Amir Gilad,
Gefen Keinan, Nave Frost, Yuval Moskovitch, Slava Novgorodov, Kathy
Razmadze, Noam Rinetzky, Amit Somech, Brit Youngmann, …