variety with big data volume and research works to copeanalysis in the big data era popular big data...
TRANSCRIPT
![Page 1: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/1.jpg)
Research Works to Cope with Big Data Volume and
Variety
Jiaheng LuUniversity of Helsinki, Finland
![Page 2: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/2.jpg)
Big Data: 4Vs
Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html
![Page 3: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/3.jpg)
Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html
Hadoop and Spark platform optimization
![Page 4: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/4.jpg)
Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html
Multi-model databases: quantum framework and category theory
![Page 5: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/5.jpg)
Outline
▶ Big data platform optimization (10 mins)▶ Motivation
▶ Two main principles and approaches
▶ Experimental results
▶ Multi-model databases (10 mins)▶ Overview
▶ Quantum framework
▶ Category theory
5
![Page 6: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/6.jpg)
Optimizing parameters in Hadoop and Spark platforms
![Page 7: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/7.jpg)
Analysis in the Big Data Era
▶ Key to success = Timely and Cost-effective analysis
7
Data Analysis Decision making
Massive data Insights Saving and
Revenue
![Page 8: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/8.jpg)
Analysis in the Big Data Era
▶ Popular big data platform:Hadoop and Spark
▶ Burden on users▶ Responsible for provisioning and configuration
▶ Usually lack expertise to tune the system
8
![Page 9: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/9.jpg)
Analysis in the Big Data Era
▶ Popular big data platform:Hadoop and Spark
▶ Burden on users▶ Responsible for provisioning and configuration
▶ Usually lack expertise to tune the system
9
As a data scientist, I do not know how to improve the efficiency of my job?
![Page 10: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/10.jpg)
Analysis in the Big Data Era
▶ Popular big data platform: Hadoop and Spark▶ Burden on users: provision and tuning▶ Effect of system-tuning for jobs
10
Tuned vs. Default
Running time Often 10x
System resource utilization
Often 10x
Others Well tuned jobs may avoid failures like OOM, out of disk, job time out, etc.
Good performance after tuning
![Page 11: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/11.jpg)
Automatic job optimization toolkit
▶ NOT our goal: Change the codes of the system to improve the efficiency
▶ Our goal: Configure the parameters to achieve good performance
11
Our system is easy to be used in the existing Hadoop and Spark system
![Page 12: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/12.jpg)
▶ Given a MapReduce or Spark job with input data and running cluster, we find the setting of parameters that optimize the execution time of the job. (i.e. minimize the job execution time)
Problem definition
12
![Page 13: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/13.jpg)
Challenges: too many parameters!
13
There are more than 190 parameters in Hadoop!
![Page 14: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/14.jpg)
Two key ideas in job optimizer
14
1.Reduce search space!
190 413
![Page 15: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/15.jpg)
13 parameters we tune
15
![Page 16: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/16.jpg)
16
Four key factors
▶ We identify four key factors (parameters) to model a MapReduce job execution ▶ The number of Map task waves m. (number of Map tasks)
▶ The number of Reduce task waves r. (number of Reduce tasks)
▶ The Map output compression option c. (true or false)
▶ The copy speed in the Shuffle phase v (number of parallel copiers)
![Page 17: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/17.jpg)
17
Cost model ▶ Producer: the time to produce the Map outputs in m
waves
▶ Transporter: the non-overlapped time to transport Map outputs to the Reduce side
▶ Consumer: the time to produce Reduce outputs in r waves
![Page 18: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/18.jpg)
Two key ideas in job optimizer
18
2. Keep everything busy!▶ CPU: map, reduce and compression▶ I/O: sort and merge▶ Network: shuffle
![Page 19: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/19.jpg)
19
Keep map and shuffle parallel
![Page 20: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/20.jpg)
MRTunner approach:
20
Given a new job for tuning
Retrieve the profile data
Search for the four key parameters
Compute the 13 parameters
Configure and run Good
performance after tuning
![Page 21: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/21.jpg)
Architecture:
21
Job Optimizer
Profile query engine
Job Profile
DataProfile
ResourceProfile
Hadoop MR Log HDFS OS/Ganglia
Offline
Online
![Page 22: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/22.jpg)
Profile data▶ Job profile
▶ Selectivity of Map input/output
▶ Selectivity of Reduce input/output
▶ Ratio of Map output compression
▶ Data profile
▶ Data Size
▶ Distribution of input key/value
▶ System profile
▶ Number of machines
▶ Network throughput
▶ Compression/Decompression throughput
▶ Overhead to create a map or reduce task22
![Page 23: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/23.jpg)
23
Experimental evaluation
▶ Performance Comparison ▶ Hadoop-X (Commercial Hadoop): ▶ Starfish: Parameters advised by Starfish▶ MRTuner: Parameters advised by MRTuner
▶ Workloads▶ Terasort ▶ N-gram ▶ Pagerank
![Page 24: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/24.jpg)
24
Effectiveness of MRTuner Job Optimizer
Running time of jobs
For N-gram job, MRTuner obtains more than 20x speedup than Hadoop-X
Commercial Hadoop-X
![Page 25: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/25.jpg)
25
Comparison between Hadoop-X and MRTuner (N-gram)
Hadoop-X MRTuner
Cluster-wide Resource Usage from Ganglia
CPU and Network utilizations are higher.
![Page 26: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/26.jpg)
26
Impact of Parameters on Selected Jobs
MRTuner tuned time: t1Results after changing parameters to Hadoop-X setting: t2The impact: (t2-t1)/t1, then normalize all the impacts
Compression is important
Map task # is important
Reduce task # is important
![Page 27: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/27.jpg)
Ongoing research topics
▶ Efficient job optimization on YARN and Spark
▶ Tune for container size and executor size
27
![Page 28: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/28.jpg)
Multi-model databases: Quantum framework and category theory
![Page 29: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/29.jpg)
Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html
Multi-model databases
![Page 30: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/30.jpg)
Motivation: one application to include multi-model data
An E-commerce example with multi-model data
Sale history
RecommendCustomer
ShoppingCart
Product Catalog
![Page 31: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/31.jpg)
NoSQL database types
Photo downloaded from: http://www.vikramtakkar.com/2015/12/nosql-types-of-nosql-database-part-2.html
![Page 32: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/32.jpg)
Multiple NoSQL databases
Sales Social media Customer
CatalogShopping-cart
MongoDB
MongoDBRedis
MongoDBNeo4j
![Page 33: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/33.jpg)
Multi-model DB
Tabular
RDFXML
Spatial
TextMulti-model DB JSON
• One unified database for multi-model data
![Page 34: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/34.jpg)
Challenge: a new theory foundation
Call for a unified model and theory for multi-model data!
The theory of relations (150 years old) is not adequate to mathematically describe modern (NoSQL) DBMS.
![Page 35: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/35.jpg)
Two possible theory foundations
Quantum framework; Approximate query processing for open field in multi-model databases
Category theory: Exact query processing and schema mapping for close field in multi-model databases
35
![Page 36: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/36.jpg)
Quantum framework
Database is based on the components:Logic (SQL expressions)Algebra (relational algebra)’
The Quantum framework adds quantum probability and quantum algebra.
36
![Page 37: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/37.jpg)
Why not classical probability
Apply three rules on multi-model dataQuantum superpositionQuantum entanglementQuantum inference
37
![Page 38: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/38.jpg)
Unifying multi-model data in Hilbert space
38
XMLRelation
RDF Graph
Use quantum probability to answer the query approximately
![Page 39: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/39.jpg)
Two possible theory foundations
Quantum framework; Approximate query processing for open field in multi-model databases
Category theory: Exact query processing and schema mapping for close field in multi-model databases
39
![Page 40: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/40.jpg)
What is category theory?It was invented in the early 1940’s
Category theory has been proposed as a new foundation for mathematics (to replace set theory)
A category has the ability to compose the arrows associatively
![Page 41: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/41.jpg)
Unified data model
• One unified data model with objects and morphisms
XMLRelation
RDF Graph
![Page 42: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/42.jpg)
Unified language model
Tabular
SPAQLXPath Xquery
SQL
• One unified language model with functors
![Page 43: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/43.jpg)
Transformation
• Natrual transformation between multiple language for multi-model data
![Page 44: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/44.jpg)
Ongoing research topics
▶ Approximate query processing based on quantum framework
▶▶ Multi-model data integration based on
category theory
44
![Page 45: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/45.jpg)
Conclusion
1. Parameter tuning is important for Big data platform like Hadoop and Spark.
2. Emerging two new theoretical foundations on multi-model databases: quantum framework and category theory
45
![Page 46: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/46.jpg)
Reference
▶ Jiaheng Lu, Irena Holubová: Multi-model Data Management: What's New and What's Next? EDBT 2017: 602-605
▶ Jiaheng Lu: Towards Benchmarking Multi-Model Databases. CIDR 2017
▶ Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, Chen Wang: MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs. PVLDB 7(13): 1319-1330 (2014)
46
![Page 47: Variety with Big Data Volume and Research Works to CopeAnalysis in the Big Data Era Popular big data platform: Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning](https://reader033.vdocument.in/reader033/viewer/2022042313/5ee04414ad6a402d666b7b63/html5/thumbnails/47.jpg)
47