hadoop-ds: which sql-on-hadoop rules the herd
TRANSCRIPT
© 2014 IBM Corporation
Information Management
Evaluating SQL-on-Hadoop
Performance and Compatibility
IBM Big SQL Hadoop-DS Benchmark Last revised: Oct 26, 2014
© 2014 IBM Corporation 2
Information Management
Agenda
About Big SQL
The TPC-DS™ Benchmark
The Hadoop-DS Benchmark
Big SQL performance
30TB Hadoop-DS result with IBM Big SQL
10TB Hadoop-DS comparison with Cloudera Impala™ and
Hortonworks® Hive
Conclusions
Additional detail
TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council
Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.
Hortonworks is a trademark of Hortonworks inc.
Other company, product, or service names may be trademarks or service marks of others.
© 2014 IBM Corporation 3
Information Management
The case for SQL on Hadoop
SQL has become ubiquitous in today’s data center
Customers have large existing investments
Skills, commercial & in-house applications
70% of Big Data initiatives involve transactional data1
Transactional big data well suited to SQL
Standardization & compatibility are essential
Customers modernizing warehouse environments cannot afford
separate SQL dialects and tools for different data sources
1. 70% of 465 survey respondents cite transactional data as a primary target for big data initiatives - Gartner research note “Survey Analysis - Big
Data Adoption in 2013 Shows Substance Behind the Hype“ Sept 12 2013 Analyst(s): Lisa Kart, Nick Heudecker, Frank Buytendijk
© 2014 IBM Corporation 4
Information Management
IBM InfoSphere BigInsights - Big SQL
Big SQL = Big Investment Protection
Rich ANSI SQL support
Outstanding performance
Native Hadoop data sources
Federation: multiple data
sources
Extensive analytic functions
Security built-in
Native Hadoop Data Sources
CSV SEQ Parquet RC
AVRO ORC JSON Custom
Optimized SQL MPP Run-time
Big SQL
SQL based
Application
IBM invented SQL and has over thirty
years of experience engineering
advanced SQL query engines
© 2014 IBM Corporation 5
Information Management
IBM InfoSphere BigInsights - Big SQL
Application Portability &
Integration
Native Hadoop Data
Comprehensive file formats
Performance
Powerful query re-writer
Cost-based optimizer
Sophisticated memory mgmt.
Federation
Single SQL statement
Multiple data sources
DB2, Oracle, Teradata & more
Enterprise Features
Security & Auditing
Self-tuning & management
Comprehensive monitoring
Rich SQL Support
ANSI Compliant
IBM PL SQL Compatible
Extensive Analytic Functions
Big SQL = Big Investment Protection
© 2014 IBM Corporation 6
Information Management
About the TPC-DS Benchmark – www.tpc.org
Models a hypothetical retail operation
Realistic multi-domain data warehouse environment
Retail sales, web, catalog data, inventory, demographics & promotions
Models several aspects of business operations
Queries, concurrency, data loading, data maintenance
Designed for relational data warehouse product offerings
Four broad types of queries:
Reporting queries, Ad-hoc queries, Iterative OLAP queries, Data mining
queries – 99 queries in all
Designed for multiple scale factors
100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB
Designed for multi-user concurrency
Minimum of 4 concurrent users running all 99 queries
No vendor has ever published a formal TPC-DS benchmark
TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council
© 2014 IBM Corporation 7
Information Management
Beware of Cherry Pickers & Benchmarketing!
TPC-DS has strict requirements
All 99 queries need to be run in their entirety
Each query is unique and tests a different
facet of the environment
Answer set correctness must be proven
Result must be audited
As a result, it is not valid to:
Select individual queries
Change queries outside of prescribed
guidelines (“minor query modifications”)
Alter the database schema
Configure the system on a per-query basis
Alter the system between single-user and
multi-user tests
© 2014 IBM Corporation 8
Information Management
About the Hadoop-DS Benchmark
Created by IBM
The Big Data Decision Support Benchmark (Hadoop-DS) is inspired
by, and is highly compliant with TPC-DS
Fully complies with the TPC-DS schema requirement
Uses all 99 queries
Meets the multi-user requirement
Has been audited by an approved TPC-DS auditor but as a non-TPC
benchmark
There are deviations from TPC-DS
No data maintenance operations, referential integrity enforcement, or ACID
property validation as these are not feasible with HDFS
Additional statistics used (advanced Big SQL capability)
Different metric (to avoid confusion with TPC-DS)
No price and price/performance measures included
Not an official TPC benchmark result
TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council
© 2014 IBM Corporation 9
Information Management
What are the key Hadoop-DS metrics?
Primary metrics:
Qph Hadoop-DS@SF (Single User)
• Single User Queries-per-hour at a particular scaling factor)
Qph Hadoop-DS@SF (Multi User)
• Multi User Queries-per-hour at a particular scaling factor)
Two distinct measures
Power run – refers to a single stream of queries running in sequence
Throughput run – refers to a multiple streams of queries executing
concurrently. A minimum of four concurrent streams is required
© 2014 IBM Corporation 10
Information Management
What did IBM Test?
30TB Hadoop-DS benchmark with Big SQL
Executed and audited in as compliant a manner as possible
• Demonstrate the robustness of Big SQL at scale
• Demonstrate Big SQL’s ability to run all queries
• Demonstrate Big SQL’s multi-user concurrency capability
Letter of attestation from the auditing firm and accompanying
benchmark report.
10TB subset Hadoop-DS benchmark with 3 vendors
Compare the Big SQL, Cloudera Impala and Hortonworks Hive
Use the subset of queries all three vendors are able to execute
Use an identical 17 node cluster for each vendor
Auditor reviewed method, procedures and measurement results
Two main benchmarks were executed
© 2014 IBM Corporation 11
Information Management
Benchmark Environment
Management Node
One x3650 M4 BD Two E5-2680 v2 2.8GHz 10-core
128GB RAM, 1866MHz
2TB 3.5” HDD
Dual-port 10GbE
RHEL 6.4
EXT4/HDFS/Parquet/ORC
Data Nodes
Seventeen x3650 M4 BD Two E5-2680 2.8GHz 10-core
128GB RAM, 1866MHz
Ten 2TB 3.5” HDD
Four 120GB 3.5” SSD
Dual-port 10GbE
RHEL 6.4
EXT4/HDFS/Parquet/ORC
Three identical clusters deployed, one for each distribution
Note: Big SQL and Impala used
Parquet file formats. Hive used
ORC
© 2014 IBM Corporation 12
Information Management
Big SQL 3.0
Working directly from template
Compliant query modifications
Impala 1.4.1
Working directly from template
Compliant query modifications
Non-compliant query re-write
Not working or no re-write
IBM Big SQL runs 100% of the queries
IBM Big SQL runs all 99 queries, 12
with allowable minor modifications
Impala runs only 52 queries – 35 out-
of-the-box and 17 with allowable
minor modifications
Hive runs 58 queries – 32 out of the
box, and 26 with allowable minor
modifications
Hive 0.13
Working directly from template
Compliant query modifications
Non-compliant query re-write
Not working or no re-write
© 2014 IBM Corporation 13
Information Management
Query compliance by SQL-on-Hadoop offering
IBM is the only vendor
able to run all 99
Hadoop-DS queries with
minor modifications
allowable under TPC-DS
benchmark rules
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Big SQL 3.0 Impala 1.4.1 Hive 0.13
Not working or no re-write
Non-compliant query re-write
Compliant query modifications
Working directly from template
Queries
© 2014 IBM Corporation 14
Information Management
Hadoop DS – Query Compliance Detail
Small-scale Test (1 GB) 10TB scale Test
Number of queries Big SQL 3.0 Impala 1.4.1 Hive 0.13 Big SQL 3.0 Impala 1.4.1 Hive 0.13
Original query
unchanged 87 35 32 87 31 27
Minor query
modifications 12 17 26 12 11 29
Major query re-write 0 36 32 0 30 13
Percentage of
queries that run 100% 89% 91% 100% 73% 70%
Not working or no
re-write found 0 11 9 0 27 30
© 2014 IBM Corporation 15
Information Management
IBM Big SQL – Runs 100% of the queries
Key points
In competitive environments,
many queries needed to be re-
written, some significantly
Owing to various restrictions,
some queries could not be re-
written or failed at run-time
Re-writing queries in a
benchmark scenario where
results are known is one thing –
doing this against real databases
in production is another
Competitive environments require significant effort at scale
Results for 10TB scale shown here
© 2014 IBM Corporation 16
Information Management
Hadoop-DS – Performance results
Elapsed time (s) Qph-HDS@10TB Big SQL Advantage
# Queries Power Throughput Power Throughput Power Throughput
IBM Big SQL 3.0 46 2,908 6,945 5,694 9,537
Impala 1.4.1 46 10,536 14,920 1,571 4,439 3.6 2.1
Hive 0.13 46 15,949 59,550 1,038 1,112 5.4 8.5
All 99 queries @ 10TB
IBM Big SQL 3.0 99 32,361 88,764 1,101 2,409
Impala 1.4.1 99 Not Possible
Hive 0.13 99 Not Possible
All 99 queries @ 30TB
IBM Big SQL 3.0 99 104,445 187,993 1,023 2,274
Impala 1.4.1 99 Not Possible
Hive 0.13 99 Not Possible
© 2014 IBM Corporation 17
Information Management
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
Big SQL Impala Hive
Power run (single-stream) – seconds
As measured across the subset of 46 queries that Impala and Hive can both run
IBM Big SQL – Leading performance Up to 5.4x FASTER!!
48:28
2:55:36
4:25:49
3.6x faster than Impala, 5.4x faster than Hive
seconds
© 2014 IBM Corporation 18
Information Management
0
10000
20000
30000
40000
50000
60000
70000
Big SQL Impala Hive
Throughput run - 4 streams, average elapsed time
As measured across the subset of 46 queries that Impala and Hive can both run
IBM Big SQL – Leading performance
1:55:45
4:08:40
16:32:30
Up to 8.5x
FASTER!! 2.1x faster than Impala, 8.5x faster than Hive
seconds
© 2014 IBM Corporation 19
Information Management
30TB Hadoop-DS Results
Because other distributions could not run the 99 required
queries, it was only possible to obtain a result for Big SQL
IBM had hoped to obtain partial results @ 30TB (comparing
queries that would run across distributions)
Testing convinced us that the number of queries that
competitors could run @ 30TB was sufficiently small that a
detailed comparison would not be valid
© 2014 IBM Corporation 20
Information Management
Big SQL – Scalability and Throughput
Four concurrent query streams @30TB in 1.8x time of a single stream
0
50,000
100,000
150,000
200,000
Power Run Throughput Run
Ela
pse
d T
ime
(se
cs
)
Elapsed Times for Big SQL Hadoop-DS @30TB. Single & 4 streams. 99 queries.
99 queries
396
queries
© 2014 IBM Corporation 21
Information Management
Audited Results Letters of attestation are available for
both Hadoop-DS benchmarks at
10TB and 30TB scale
InfoSizing, Transaction Processing
Performance Council Certified
Auditors verified both IBM results as
well as results on Cloudera Impala
and Hortonworks Hive
These results are for a non-TPC
benchmark. A subset of the TPC-DS
Benchmark standard requirements
was implemented
© 2014 IBM Corporation 22
Information Management
Conclusions
Big SQL is the only SQL-on-Hadoop engine able to run a
full Hadoop-DS workload
Complete schema
All 99 queries
Multi-user test
Ran at both 10TB and 30TB data volumes
Together this test makes for a good predictor of compatibility with
real applications
IBM Big SQL is the best performing solution by a large
margin
~ 3.6 times better than Cloudera Impala
~ 5.4 times better than Hortonworks Hive
© 2014 IBM Corporation 23
Information Management
Thank you!
© 2014 IBM Corporation 24
Information Management
Additional Slides
© 2014 IBM Corporation 25
Information Management
0 500 1,000 1,500 2,000 2,500 3,000
Big SQL
Impala
Hive .13
4 concurrent streams and 99 queries
Query throughput for Hadoop-DS @ 10TB
87
12
99 queries could not be run
99 queries could not be run
Effective query throughput (Qph-HDS@10TB)
© 2014 IBM Corporation 26
Information Management
0 500 1,000 1,500 2,000 2,500
Big SQL
Impala
Hive .13
6 concurrent streams and 99 queries
Effective query throughput (Qph-HDS@30TB)
Query throughput for Hadoop-DS @ 30TB
99 queries could not be run
99 queries could not be run
© 2014 IBM Corporation 27
Information Management
The Common Query Set
While Big SQL ran all queries, many of the Hadoop-DS
queries would not run on Impala or Hive
On both platforms, some additional queries could be
made to run by re-writing the queries (something that is
not permitted in the TPC-DS benchmark specification)
At 10TB scale, several queries failed at run-time
This set of 46 queries are the common set that ran at 10
TB scale and could thus be compared
The testing team deliberately included some queries
with non-compliant query modifications where the
changes were judged to be minor in order to have a
reasonable number of queries to compare
46 queries could be run on Big SQL, Impala and Hive at 10TB
Queries shown in blue are part of the common set
© 2014 IBM Corporation 28
Information Management
About the TPC-DS queries
The queries are diverse, and many are complex
Reflecting real business needs – a random sample: Find customers returning items more frequently than normal (q1)
States with customers most ammenable to premium priced offers (q6)
List key metrics for unadvertised in-store promotions by demographic (q7)
Identify similar customers purchasing through multiple sales outlets (q10)
Find customers shifting purchasing habits to the web (q11)
Key measures for catalog sales fulfilled from an alternate warehouse (q16)
Find frequently sold items and the circumstances under which repeat sales
take place (q23)
Understand the products and retail locations where items are likely to be
return and subsequently re-purchased via the catalog (q29)
Display customers making significant local purchases comparing to buying
potential based on dependents and vehicles owned (q34)
© 2014 IBM Corporation 29
Information Management
Benchmark Environment
X3650BD Data node #1
X3650BD Data node #2
X3650BD Data node #3
X3650BD Data node #4
X3650BD Data node #5
X3650BD Data node #6
X3650BD Data node #7
X3650BD Data node #8
X3650BD Data node #9
X3650BD Data node #10
X3650BD Data node #11
X3650BD Data node #12
X3650BD Data node #13
X3650BD Data node #14
X3650BD Data node #15
X3650BD Data node #16
10 GbE switch 10 GbE private net
IBM Blue net
Mgmt net
X3650BD Master host
Three identical clusters deployed, one for each distribution