hadoop-ds: which sql-on-hadoop rules the herd

© 2014 IBM Corporation

Information Management

Evaluating SQL-on-Hadoop

Performance and Compatibility

IBM Big SQL Hadoop-DS Benchmark Last revised: Oct 26, 2014

© 2014 IBM Corporation 2


Agenda

About Big SQL

The TPC-DS™ Benchmark

The Hadoop-DS Benchmark

Big SQL performance

30TB Hadoop-DS result with IBM Big SQL

10TB Hadoop-DS comparison with Cloudera Impala™ and

Hortonworks® Hive

Conclusions

Additional detail

TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council

Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.

Hortonworks is a trademark of Hortonworks inc.

Other company, product, or service names may be trademarks or service marks of others.



The case for SQL on Hadoop

SQL has become ubiquitous in today’s data center

Customers have large existing investments

Skills, commercial & in-house applications

70% of Big Data initiatives involve transactional data1

Transactional big data well suited to SQL

Standardization & compatibility are essential

Customers modernizing warehouse environments cannot afford

separate SQL dialects and tools for different data sources

1. 70% of 465 survey respondents cite transactional data as a primary target for big data initiatives - Gartner research note “Survey Analysis - Big

Data Adoption in 2013 Shows Substance Behind the Hype“ Sept 12 2013 Analyst(s): Lisa Kart, Nick Heudecker, Frank Buytendijk



IBM InfoSphere BigInsights - Big SQL

Big SQL = Big Investment Protection

Rich ANSI SQL support

Outstanding performance

Native Hadoop data sources

Federation: multiple data

sources

Extensive analytic functions

Security built-in

Native Hadoop Data Sources

CSV SEQ Parquet RC

AVRO ORC JSON Custom

Optimized SQL MPP Run-time

Big SQL

SQL based

Application

IBM invented SQL and has over thirty

years of experience engineering

advanced SQL query engines



IBM InfoSphere BigInsights - Big SQL

Application Portability &

Integration

Native Hadoop Data

Comprehensive file formats

Performance

Powerful query re-writer

Cost-based optimizer

Sophisticated memory mgmt.

Federation

Single SQL statement

Multiple data sources

DB2, Oracle, Teradata & more

Enterprise Features

Security & Auditing

Self-tuning & management

Comprehensive monitoring

Rich SQL Support

ANSI Compliant

IBM PL SQL Compatible

Extensive Analytic Functions

Big SQL = Big Investment Protection



About the TPC-DS Benchmark – www.tpc.org

Models a hypothetical retail operation

Realistic multi-domain data warehouse environment

Retail sales, web, catalog data, inventory, demographics & promotions

Models several aspects of business operations

Queries, concurrency, data loading, data maintenance

Designed for relational data warehouse product offerings

Four broad types of queries:

Reporting queries, Ad-hoc queries, Iterative OLAP queries, Data mining

queries – 99 queries in all

Designed for multiple scale factors

100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB

Designed for multi-user concurrency

Minimum of 4 concurrent users running all 99 queries

No vendor has ever published a formal TPC-DS benchmark




Beware of Cherry Pickers & Benchmarketing!

TPC-DS has strict requirements

All 99 queries need to be run in their entirety

Each query is unique and tests a different

facet of the environment

Answer set correctness must be proven

Result must be audited

As a result, it is not valid to:

Select individual queries

Change queries outside of prescribed

guidelines (“minor query modifications”)

Alter the database schema

Configure the system on a per-query basis

Alter the system between single-user and

multi-user tests



About the Hadoop-DS Benchmark

Created by IBM

The Big Data Decision Support Benchmark (Hadoop-DS) is inspired

by, and is highly compliant with TPC-DS

Fully complies with the TPC-DS schema requirement

Uses all 99 queries

Meets the multi-user requirement

Has been audited by an approved TPC-DS auditor but as a non-TPC

benchmark

There are deviations from TPC-DS

No data maintenance operations, referential integrity enforcement, or ACID

property validation as these are not feasible with HDFS

Additional statistics used (advanced Big SQL capability)

Different metric (to avoid confusion with TPC-DS)

No price and price/performance measures included

Not an official TPC benchmark result




What are the key Hadoop-DS metrics?

Primary metrics:

Qph Hadoop-DS@SF (Single User)

• Single User Queries-per-hour at a particular scaling factor)

Qph Hadoop-DS@SF (Multi User)

• Multi User Queries-per-hour at a particular scaling factor)

Two distinct measures

Power run – refers to a single stream of queries running in sequence

Throughput run – refers to a multiple streams of queries executing

concurrently. A minimum of four concurrent streams is required



What did IBM Test?

30TB Hadoop-DS benchmark with Big SQL

Executed and audited in as compliant a manner as possible

• Demonstrate the robustness of Big SQL at scale

• Demonstrate Big SQL’s ability to run all queries

• Demonstrate Big SQL’s multi-user concurrency capability

Letter of attestation from the auditing firm and accompanying

benchmark report.

10TB subset Hadoop-DS benchmark with 3 vendors

Compare the Big SQL, Cloudera Impala and Hortonworks Hive

Use the subset of queries all three vendors are able to execute

Use an identical 17 node cluster for each vendor

Auditor reviewed method, procedures and measurement results

Two main benchmarks were executed



Benchmark Environment

Management Node

One x3650 M4 BD Two E5-2680 v2 2.8GHz 10-core

128GB RAM, 1866MHz

2TB 3.5” HDD

Dual-port 10GbE

RHEL 6.4

EXT4/HDFS/Parquet/ORC

Data Nodes

Seventeen x3650 M4 BD Two E5-2680 2.8GHz 10-core

128GB RAM, 1866MHz

Ten 2TB 3.5” HDD

Four 120GB 3.5” SSD

Dual-port 10GbE

RHEL 6.4

EXT4/HDFS/Parquet/ORC

Three identical clusters deployed, one for each distribution

Note: Big SQL and Impala used

Parquet file formats. Hive used

ORC



Big SQL 3.0

Working directly from template

Compliant query modifications

Impala 1.4.1



Non-compliant query re-write

Not working or no re-write

IBM Big SQL runs 100% of the queries

IBM Big SQL runs all 99 queries, 12

with allowable minor modifications

Impala runs only 52 queries – 35 out-

of-the-box and 17 with allowable

minor modifications

Hive runs 58 queries – 32 out of the

box, and 26 with allowable minor

modifications

Hive 0.13







Query compliance by SQL-on-Hadoop offering

IBM is the only vendor

able to run all 99

Hadoop-DS queries with

minor modifications

allowable under TPC-DS

benchmark rules

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Big SQL 3.0 Impala 1.4.1 Hive 0.13





Queries



Hadoop DS – Query Compliance Detail

Small-scale Test (1 GB) 10TB scale Test

Number of queries Big SQL 3.0 Impala 1.4.1 Hive 0.13 Big SQL 3.0 Impala 1.4.1 Hive 0.13

Original query

unchanged 87 35 32 87 31 27

Minor query

modifications 12 17 26 12 11 29

Major query re-write 0 36 32 0 30 13

Percentage of

queries that run 100% 89% 91% 100% 73% 70%

Not working or no

re-write found 0 11 9 0 27 30



IBM Big SQL – Runs 100% of the queries

Key points

In competitive environments,

many queries needed to be re-

written, some significantly

Owing to various restrictions,

some queries could not be re-

written or failed at run-time

Re-writing queries in a

benchmark scenario where

results are known is one thing –

doing this against real databases

in production is another

Competitive environments require significant effort at scale

Results for 10TB scale shown here



Hadoop-DS – Performance results

Elapsed time (s) Qph-HDS@10TB Big SQL Advantage

# Queries Power Throughput Power Throughput Power Throughput

IBM Big SQL 3.0 46 2,908 6,945 5,694 9,537

Impala 1.4.1 46 10,536 14,920 1,571 4,439 3.6 2.1

Hive 0.13 46 15,949 59,550 1,038 1,112 5.4 8.5

All 99 queries @ 10TB

IBM Big SQL 3.0 99 32,361 88,764 1,101 2,409

Impala 1.4.1 99 Not Possible

Hive 0.13 99 Not Possible

All 99 queries @ 30TB

IBM Big SQL 3.0 99 104,445 187,993 1,023 2,274

Impala 1.4.1 99 Not Possible

Hive 0.13 99 Not Possible



0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

Big SQL Impala Hive

Power run (single-stream) – seconds

As measured across the subset of 46 queries that Impala and Hive can both run

IBM Big SQL – Leading performance Up to 5.4x FASTER!!

48:28

2:55:36

4:25:49

3.6x faster than Impala, 5.4x faster than Hive

seconds



0

10000

20000

30000

40000

50000

60000

70000

Big SQL Impala Hive

Throughput run - 4 streams, average elapsed time

As measured across the subset of 46 queries that Impala and Hive can both run

IBM Big SQL – Leading performance

1:55:45

4:08:40

16:32:30

Up to 8.5x

FASTER!! 2.1x faster than Impala, 8.5x faster than Hive

seconds



30TB Hadoop-DS Results

Because other distributions could not run the 99 required

queries, it was only possible to obtain a result for Big SQL

IBM had hoped to obtain partial results @ 30TB (comparing

queries that would run across distributions)

Testing convinced us that the number of queries that

competitors could run @ 30TB was sufficiently small that a

detailed comparison would not be valid



Big SQL – Scalability and Throughput

Four concurrent query streams @30TB in 1.8x time of a single stream

0

50,000

100,000

150,000

200,000

Power Run Throughput Run

Ela

pse

d T

ime

(se

cs

)

Elapsed Times for Big SQL Hadoop-DS @30TB. Single & 4 streams. 99 queries.

99 queries

396

queries



Audited Results Letters of attestation are available for

both Hadoop-DS benchmarks at

10TB and 30TB scale

InfoSizing, Transaction Processing

Performance Council Certified

Auditors verified both IBM results as

well as results on Cloudera Impala

and Hortonworks Hive

These results are for a non-TPC

benchmark. A subset of the TPC-DS

Benchmark standard requirements

was implemented



Conclusions

Big SQL is the only SQL-on-Hadoop engine able to run a

full Hadoop-DS workload

Complete schema

All 99 queries

Multi-user test

Ran at both 10TB and 30TB data volumes

Together this test makes for a good predictor of compatibility with

real applications

IBM Big SQL is the best performing solution by a large

margin

~ 3.6 times better than Cloudera Impala

~ 5.4 times better than Hortonworks Hive



Thank you!



Additional Slides



0 500 1,000 1,500 2,000 2,500 3,000

Big SQL

Impala

Hive .13

4 concurrent streams and 99 queries

Query throughput for Hadoop-DS @ 10TB

87

12

99 queries could not be run


Effective query throughput (Qph-HDS@10TB)



0 500 1,000 1,500 2,000 2,500

Big SQL

Impala

Hive .13

6 concurrent streams and 99 queries

Effective query throughput (Qph-HDS@30TB)

Query throughput for Hadoop-DS @ 30TB





The Common Query Set

While Big SQL ran all queries, many of the Hadoop-DS

queries would not run on Impala or Hive

On both platforms, some additional queries could be

made to run by re-writing the queries (something that is

not permitted in the TPC-DS benchmark specification)

At 10TB scale, several queries failed at run-time

This set of 46 queries are the common set that ran at 10

TB scale and could thus be compared

The testing team deliberately included some queries

with non-compliant query modifications where the

changes were judged to be minor in order to have a

reasonable number of queries to compare

46 queries could be run on Big SQL, Impala and Hive at 10TB

Queries shown in blue are part of the common set



About the TPC-DS queries

The queries are diverse, and many are complex

Reflecting real business needs – a random sample: Find customers returning items more frequently than normal (q1)

States with customers most ammenable to premium priced offers (q6)

List key metrics for unadvertised in-store promotions by demographic (q7)

Identify similar customers purchasing through multiple sales outlets (q10)

Find customers shifting purchasing habits to the web (q11)

Key measures for catalog sales fulfilled from an alternate warehouse (q16)

Find frequently sold items and the circumstances under which repeat sales

take place (q23)

Understand the products and retail locations where items are likely to be

return and subsequently re-purchased via the catalog (q29)

Display customers making significant local purchases comparing to buying

potential based on dependents and vehicles owned (q34)



Benchmark Environment

X3650BD Data node #1
















10 GbE switch 10 GbE private net

IBM Blue net

Mgmt net

X3650BD Master host

Three identical clusters deployed, one for each distribution

hadoop-ds: which sql-on-hadoop rules the herd

Technology

big sql sql

hadoop sql

catalog data

data maintenance

data loading

application ibm invented

data mining queries

separate sql dialects