towards selecting best combination of sql-on-hadoop systems …€¦ · •spark, tez, flink, yarn,...

34
IBM Research © 2016 IBM Corporation Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs Tatsuhiro Chiba , Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research

Upload: others

Post on 28-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii

IBM Research

Page 2: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Agenda§ Motivation, Problems and Challenges§ Backgrounds

– Backend engines: Spark and Tez– Backend runtimes: OpenJDK and J9

§ Empirical Study– Performance evaluation– Performance analysis

§ ML model– training classification model– evaluating classification model

§ Summary

2 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Page 3: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Distributed Processing Framework for Big Data

§ Hadoop Eco-Systems– HDFS: the center of data store– utilizing data between different frameworks• Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc…

§ Big Data Workload– ETL– SQL–ML / DL / Streaming

§ JVM as a Hadoop Runtime– disk-oriented à in-memory oriented– I/O intensive à CPU-intensive

3

Dist. Framework

Application

JVM

OS

Hardware

IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Page 4: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Motivation and Problem ‒ Many choices of the systems

§ Rapid Development Cycle– Fast open sources releases– marge new feature frequently– query performance is also improved

§ Too many SQL-on-Hadoop Systems– Which one is best? (SparkSQL or Hive or Impara or Presto or …)– Should we switch a system to another one?– No single SQL-on-Hadoop engine is best for ALL queries– No single JVM is best for ALL queries as well

4 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Performance Improvement History

0

50

100

150

1.4.1 1.5.2 1.6.1 1.4.1 1.5.2 1.6.1

TPC-H Q3 TPC-H Q5

exec

utio

n tim

e (se

c)

Page 5: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Motivation and Problem ‒ Selecting a system adaptively§ Requirements of Query Execution on Cloud

– query users: do not care about backend system as long as it returns a result fast– cloud providers: wants to minimize resources by using fast processing backend

§ Related work: workload translation– generate suitable code for a best system–Musketeer [Eurosys ’16], Weld [CIDR ’17]

§ Related work: Multi Store / Hybrid Engines–MISO [SIGMOD ’14], MuSQLE [BigData ’16]– using multiple engines/stores based on cost model / heuristics / etc.

5 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

No JVM awarenessneed to update cost model / heuristics frequently

Page 6: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Questions and Challenges

6 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Run SQL on best

engine/runtimeML ModelEmpirical Study

GOAL

1. What about potential gains?

2. What features make the differences?

3. What data to help building a model?

4. How accurately can the model predict?

observation

reasoning

training

predicting

Meta Scheduler

Choices of Engine and Runtime

Page 7: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Agenda§ Motivation, Problems and Challenges§ Backgrounds

– Backend engines: Spark and Tez– Backend runtimes: OpenJDK and J9

§ Empirical Study– Performance evaluation– Performance analysis

§ ML model– training classification model– evaluating classification model

§ Summary

7 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Page 8: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Spark/Spark SQL§ Spark

– DAG-based distributed framework– execute stage by stage

§ Spark SQL– Catalyst ‒ Query Optimizer– Parquet Columnar Format– code generation (SIMD, loop unrolling)

8 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs��� Michael et al., Spark SQL: Relational Data Processing in Spark , SIGMOD’15

��� https://spark.apache.org/docs/latest/cluster-overview.html

Catalyst

Page 9: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Tez/Hive

§ Tez– Generalized Map Reduce– DAG-based distributed framework

§ Hive/LLAP– focus on interactive query– Vectorization / Pipeline– In-Memory Columnar Cache (off-heap)– ORC Columnar Format

9 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Ref�Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications, SIGMOD’15

Ref� https://www.slideshare.net/Hadoop_Summit/llap-longlived-execution-in-hive, Hadoop Summit 2015

Page 10: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

JVM ‒ OpenJDK & IBM J9

§ JVM– OpenJDK / J9 (Eclipse OMR based)– internal optimization / implementation are different

§ JIT– Tiered Compilation Level– Intrinsics– Inlining Heuristics– Vectorization Code

§ Memory Management– GC Algorithm (G1GC / Generational / CMS / Parallel / Copying etc.)–Memory Fence

§ Thread– Lock Reservation

10 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Page 11: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Agenda§ Motivation, Problems and Challenges§ Backgrounds

– Backend engines: Spark and Tez– Backend runtimes: OpenJDK and J9

§ Empirical Study– Performance evaluation– Performance analysis

§ ML model– training classification model– evaluating classification model

§ Summary

11 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Page 12: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Questions and Challenges

12 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Run SQL on best

engine/runtimeML ModelEmpirical Study

GOAL

1. What about potential gains?

2. What features make the differences?

3. What data to help building a model?

4. How accurately can the model predict?

observation

reasoning

training

predicting

Meta Scheduler

Choices of Engine and Runtime

Page 13: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Environment ‒ HW/SW Spec & Benchmark

§ Machine– evaluated on a single POWER8 node– Use Flash storage for HDFS

§ TPC-DS Benchmark– hive-testbench (*1)– 68 queries

§ data set– Scale Factor 500 (500GB)– prepared two columnar dataset; Parquet & ORC

13 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Software version

Spark 2.1.0

Hadoop (HDFS) 2.7.2

Tez 0.9.0

Hive 2.2.0

OpenJDK 1.8.0_u121

IBM J9 JVM 1.8.0 SR4FP2

Machine Description

Processor POWER8

3.3 GHz * 2

# Cores 24 cores

(2 Sockets * 12 Cores)

SMT 8

Memory 1TB

Disk Flash System (9.3TB)

OS Ubuntu 16.04(kernel 4.4.0-31)

(*1) https://github.com/hortonworks/hive-testbench

Page 14: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Environment - Others§ Configurations of Spark & Tez

§ Evaluation Methodology – used Thrift Server– picked up fastest result in 5-times test per query– reset buffer cache (echo 3 > /proc/sys/vm/drop_caches)

14 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Configuration Spark / Spark SQL Tez / Hive

Executor JVM 1 1

Worker Threads 12 12

I/O Threads - 12

On Heap Size 192 GB 96 GB

Off Heap Size - 96 GB

Execution Mode Daemon (Thrift Server) LLAP Daemon

Columnar Format Parquet ORC

Compression Format gzip (zlib) gzip (zlib)

Other JVM Options (Common) GC Threads = 12, -agentpath:libjvmti_oprofile.so

Page 15: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

0

1

2

0

500

1000

1500

2000

Q72

Q15

Q12

Q55

Q63

Q17

Q52

Q13

Q93

Q98

Q46

Q34

Q42

Q68

Q32

Q50

Q60

Q18

Q75

Q31

Q84

Q83

Q22

Q56

Q89

Q26

Q39

Q67

Q94

Q47

Q95

ratio

(OpenJDK

/IBMJ9

)

Exectutio

nTime(se

c.) IBMJ9 OpenJDK ratio

Performance Comparison of TPC-DS on Spark- Which JVM is better for Spark?§ Performance Comparison Result

– OpenJDK achieved faster than J9 in 35 queries (35/62 = 56.5%)– J9 achieved faster than OpenJDK in 27 queries (27/62 = 43.5%)– leads up to 3x drawback

15

OpenJDK is faster

IBM J9 is faster

Lower is Better

IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Page 16: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

0

1

2

0

500

1000

1500

2000

Q87

Q80

Q92

Q24

Q22

Q95

Q94

Q85

Q90

Q63

Q89

Q83

Q49

Q96

Q71

Q39

Q98

Q34

Q25

Q55

Q75

Q31

Q19

Q20

Q64 Q3

Q45

Q52 Q7

Q76

Q28

Q70

Q84

ratio

(OpenJDK

/IBMJ9

)

Exectutio

nTime(se

c.) IBMJ9 OpenJDK ratio

Performance Comparison of TPC-DS on Tez- Which JVM is better for Tez?

16

OpenJDK is faster

IBM J9 is faster

Lower is Better

IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

§ Performance Comparison Result– OpenJDK achieved faster than J9 in 35 queries (35/65 = 53.8%)– J9 achieved faster than OpenJDK in 30 queries (30/65 = 46.1%)– leads up to 2x drawback

Page 17: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

0.01

0.1

1

10

100

0

500

1000

1500

2000

Q66

Q58

Q17

Q93

Q49

Q71

Q82

Q13 Q6…

Q64

Q15

Q32

Q55

Q42 Q7

Q79

Q98 Q3

Q94

Q75

Q34

Q76

Q87

Q91

Q43

Q95

Q97

Q39

Q51

Q65

ratio

(Tez/Spark)

Exectutio

nTime(se

c.) Spark Tez Ratio(tez/Spark)

Tez is faster

Spark is faster

Performance Comparison of TPC-DS with OpenJDK- Which query engine is better with OpenJDK?

17

Lower is Better (Using OpenJDK)

IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

§ Performance Comparison Result– Tez is faster in two-thirds queries than Spark– leads up to 17x drawback

Page 18: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Summary of Motivational Evaluation

§ Result– 60 queries are successfully run– picked up a best combination for all queries

§ Tendency– Tez is better than Spark– J9 is better than OpenJDK– Combination of Tez & J9 is good at in many cases

18 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Page 19: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Comparison of picked up queries

§ System– Spark wins Tez : Q51– Tez wins Spark : Q50, Q58, Q82

§ Runtime– J9 wins OpenJDK: Q51, Q58– OpenJDK wins J9: Q50, Q82

§ analysis– query plan (DAG) / middleware execution stats– hot method profiling (oprofile) / system utilization– Java method stack trace / GC Log / JIT Log

19 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Low

er is

Bet

ter

0

100

200

300

400

500

Q50 Q51 Q58 Q82

Spark (J9) Spark (OpenJDK) Tez (J9) Tez (OpenJDK)

Page 20: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

0

100

200

300

400

500

Q50 Q51 Q58 Q82

Spark (J9) Spark (OpenJDK) Tez (J9) Tez (OpenJDK)

Gain comes from JVM difference‒ Spark case§ Q51

– J9 wins–many stages– less shuffle data– gets 2.6x gain in shuffle stage

20 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

§ Q82– OpenJDK wins– few stages–much shuffle data– gets 1.4x gain in map stage

Query # Map Stages # Reduce Stages Input Read Shuffle Output Difference

Q51 2 6 6.0 GB 1.0 GB ShuffleJ9: 11s OpenJDK: 29 s

Q82 2 2 2.5 GB 5.6 GB MapJ9: 66s OpenJDK: 47s

Page 21: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Gain comes from JVM difference‒ Spark case

21 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

§ Method Profiling– J9 is good at Intrinsic for Sun.misc.Unsafe.copyMemory (JNI overhead)– OpenJDK is good at serialization and sort in data shuffling

J9 is faster OpenJDK is faster

J9 Advantage: Many Stages, less Shuffling DataOpenJDK Advantage: Few Stages, much Shuffling Data

0

100

200

300

400

500

Q50 Q51 Q58 Q82

Spark (J9) Spark (OpenJDK) Tez (J9) Tez (OpenJDK)

Page 22: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Gain comes from JVM difference‒ Tez case§ Q50

– OpenJDK wins – gets 1.7x gain in reduce vertex

22 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

§ Q51– J9 wins– gets 3x gain in map vertex

Query # Map Stages # Reduce Stages Input Records / GB Shuffle Records / GB Difference

Q50 5 7 1.3 * 10^9 (5.4 GB)

5.0 * 10^7(1.9 GB)

ReduceJ9: 106s, OpenJDK: 60s

Q51 4 5 3.5 * 10^8(1.3 GB)

3.5 * 10^8(3.5 GB)

Map J9: 7s, OpenJDK: 21s

Difference

Difference

Difference

0

100

200

300

400

500

Q50 Q51 Q58 Q82

Spark (J9) Spark (OpenJDK) Tez (J9) Tez (OpenJDK)

Page 23: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Gain comes from JVM difference‒ Tez case§ Q51

– J9 achieved 3x gain in map vertex– writing intermediate data (including in-mem agg. & SerDe) is time-consuming

23 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

In memory ORC (LLAP) Read ���

java.io.DataOutputStream ���

JOIN / Aggregation ���Serialization/Deserialization ���

PipelinedSorter (Reduce Vertex) ���

Serialize + Spill

J9 Advantage: Few Vertices, Much Shuffling Data

0

100

200

300

400

500

Q50 Q51 Q58 Q82

Spark (J9) Spark (OpenJDK) Tez (J9) Tez (OpenJDK)

Page 24: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

0246810121416

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140

usr sys wai

Gain comes from JVM difference‒ Tez case§ Q50

– OpenJDK achieved 1.7x gain in reduce vertex– many shuffle threads / many vertices– huge context switch overhead

24 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

OpenJDK

J9

ReduceMap

OpenJDK Advantage: Many Vertices, Less Shuffling Data

0

100

200

300

400

500

Q50 Q51 Q58 Q82

Spark (J9) Spark (OpenJDK) Tez (J9) Tez (OpenJDK)

0246810121416

0 10 20 30 40 50 60 70 80 90 100 110

usr sys wai

Page 25: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

M 1 M 2

R 1 R 2

R 3 R 4

R 5

R 6

websales

storesales

datedim

M 3 M 4

R 1 R 2

R 5

R 6

M 1 M 2

R 7

storesales

websales

datedim

datedim

Spark DAG Tez DAG

Gain comes from query engine difference‒ Spark or Tez§ Spark Advantage (Q51)

– reduce shuffling data by better filtering rule– Spark: 366 rows à shuffling 687MB– Tez: 8,116 rows à shuffling 3.6GB

25 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

366 rows8116 rows

687MB 3.6GB

Good query optimizer (Cost Based Optimizer) helps to reduce shuffling data

§ Tez Advantage (Q50, Q58, Q82)– reduce shuffling data by Bloom Filter

0

100

200

300

400

500

Q50 Q51 Q58 Q82

Spark (J9) Spark (OpenJDK) Tez (J9) Tez (OpenJDK)

Page 26: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Empirical Study Summary - What features affect the performance

§ Query Engine– DAG•# of Vertices / Stages (Map or Reduce)• amount of shuffling data (intermediate data)• input data size (tables)

– CBO• filtering rule

§ JVM– # of threads– Intrinsic– SerDe performance– I/O performance

26 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Page 27: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Agenda§ Motivation, Problems and Challenges§ Backgrounds

– Backend engines: Spark and Tez– Backend runtimes: OpenJDK and J9

§ Empirical Study– Performance evaluation– Performance analysis

§ ML model– training classification model– evaluating classification model

§ Summary

27 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Page 28: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Questions and Challenges

28 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Run SQL on best

engine/runtimeML ModelEmpirical Study

GOAL

1. What about potential gains?

2. What features make the differences?

3. What data to help building a model?

4. How accurately can the model predict?

observation

reasoning

training

predicting

Meta Scheduler

Choices of Engine and Runtime

Page 29: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Proposed Classifier Overview - Training and Prediction§ Key points

–Making classifier model based on the features that come from DAG– Selecting a combination of the system based on the model before query execution

29 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

+

+

9

�����

����� �������

�������

+ 9( )

+ 9( )

(

(

������� ��������

Page 30: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Training Classifier§ Why extract features from DAG? Why not SQL?

– contains much more info including table stats/actual stages than SQL§ What features are used

– # of stages, # of joins, join types, used tables, etc. – 69 features in total

30 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

)

( (

��

��

��� ����� �

��� ����� �

����� �� �����

���� � ���

Page 31: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Predicting best combination using classifier

§ Extract features without actual query run– sql explain generates DAG (compiling it in 2-5 sec)

§ Predict best system for the query– decide a combination based on the classifier

31 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

+

+

9

9

9

+ 9( )

(

+ 9( )

(

Page 32: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Evaluation of classifier§ Training and testing four ML algorithms

– kNN, Decision Tree, SVM, Random Forest– k-fold cross-validation (split data into 80:20)

§ Models– binary class: Spark or Tez–multi class: Spark/OpenJDK or Spark/J9 or Tez/OpenJDK or Tez/J9

32 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

Accuracy

Features Impact in Random Forest

Random Forest is better than others

- # of stages makes impact to the model- Using BF or Join types do not affect

Page 33: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

0

2000

4000

6000

8000

10000

12000

Q21

Q91

Q96

Q3 Q55

Q43

Q73

Q19

Q7 Q66

Q15

Q60

Q71

Q13

Q54

Q97

Q75

Q58

Q93

Q24

Q64

Accu

mul

ated

Exe

c Tim

e (se

c) baselineideal(2)predict(2)ideal(4)predict(4)

Evaluation of classifier§ training and testing model

– k-fold cross-validation except test query feature§ Result

– baseline: exec time with Spark/J9 only– ascending order

33 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

reduced by 50%

reduced by 35%

big miss prediction

Page 34: Towards Selecting Best Combination of SQL-on-Hadoop Systems …€¦ · •Spark, Tez, Flink, YARN, MR, Hive, Pig, Hbase, etc… § Big Data Workload – ETL – SQL – ML / DL

IBM Research

© 2016 IBM C orporation

Summary and Future Works

§ Summary– No single query engine and JVM is best for all queries– query engine mismatch leads up to 10x drawback– JVM mismatch also leads up to 3x drawback– Proposed Random Forest based classifier achieved 50% time reduction in total

§ Future Works– implements meta scheduler– applies it on Cloud/Container/Kubernetes environment– training data augmentation

34 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs