strata stinger talk october 2013

© Hortonworks Inc. 2013. © Hortonworks Inc. 2013.

Apache Hive and Stinger: SQL in Hadoop

Arun Murthy (@acmurthy) Alan Gates (@alanfgates) Owen O’Malley (@owen_omalley) @hortonworks

© Hortonworks Inc. 2013.


YARN: Taking Hadoop Beyond Batch

Applica'ons Run Na'vely IN Hadoop

HDFS2 (Redundant, Reliable Storage)

YARN (Cluster Resource Management)

BATCH (MapReduce)

INTERACTIVE (Tez)

STREAMING (Storm, S4,…)

GRAPH (Giraph)

IN-‐MEMORY (Spark)

HPC MPI (OpenMPI)

ONLINE (HBase)

OTHER (Search) (Weave…)

Store ALL DATA in one place…

Interact with that data in MULTIPLE WAYS

with Predictable Performance and Quality of Service



Hadoop Beyond Batch with YARN

HADOOP 1

HDFS (redundant, reliable storage)

MapReduce (cluster resource management

& data processing)

HDFS2 (redundant, reliable storage)

YARN (opera:ng system: cluster resource management)

MapReduce (batch)

Others (varied)

HADOOP 2

Single Use System Batch Apps

Multi Use Data Platform Batch, Interactive, Online, Streaming, …

Tez (interac:ve)

A shift from the old to the new…



Apache Tez (“Speed”) • Replaces MapReduce as primitive for Pig, Hive, Cascading etc.

– Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft

YARN ApplicationMaster to run DAG of Tez Tasks

Task with pluggable Input, Processor and Output

Tez Task - <Input, Processor, Output>

Task

Processor Input Output



Tez: Building blocks for scalable data processing

Classical ‘Map’ Classical ‘Reduce’

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Map Processor

HDFS Input

Sorted Output

Reduce Processor

Shuffle Input

HDFS Output

Reduce Processor

Shuffle Input

Sorted Output



Hive – MR Hive – Tez

Hive-on-MR vs. Hive-on-Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x

ORDER BY AVG;

SELECT a.state

JOIN (a, c) SELECT c.price

SELECT b.id

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

SELECT a.state, c.itemId

JOIN (a, c)

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

SELECT b.id

Tez avoids unneeded writes to

HDFS



Tez Sessions

… because Map/Reduce query startup is expensive

• Tez Sessions – Hot containers ready for immediate use – Removes task and job launch overhead (~5s – 30s)

• Hive – Session launch/shutdown in background (seamless, user not

aware) – Submits query plan directly to Tez Session

Native Hadoop service, not ad-hoc



Tez Delivers Interactive Query - Out of the Box!

Feature Descrip'on Benefit

Tez Session Overcomes Map-‐Reduce job-‐launch latency by pre-‐launching Tez AppMaster Latency

Tez Container Pre-‐Launch

Overcomes Map-‐Reduce latency by pre-‐launching hot containers ready to serve queries. Latency

Tez Container Re-‐Use Finished maps and reduces pick up more work rather than exi:ng. Reduces latency and eliminates difficult split-‐size tuning. Out of box performance!

Latency

Run:me re-‐configura:on of DAG

Run:me query tuning by picking aggrega:on parallelism using online query sta:s:cs Throughput

Tez In-‐Memory Cache Hot data kept in RAM for fast access. Latency

Complex DAGs Tez Broadcast Edge and Map-‐Reduce-‐Reduce paXern improve query scale and throughput. Throughput



S'nger Project (announced February 2013)

Batch AND Interactive SQL-IN-Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE

Coming Soon:

•  Hive on Apache Tez •  Query Service •  Buffer Cache •  Cost Based Op:mizer (Op:q) •  Vectorized Processing

Hive 0.11, May 2013: •  Base Op:miza:ons •  SQL Analy:c Func:ons •  ORCFile, Modern File Format

Hive 0.12, October 2013:

•  VARCHAR, DATE Types •  ORCFile predicate pushdown •  Advanced Op:miza:ons •  Performance Boosts via YARN

Speed Improve Hive query performance by 100X to allow for interactive query times (seconds)

Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB

SQL Support broadest range of SQL semantics for analytic applications running against Hadoop

…all IN Hadoop

Goals:



Hive 0.12

Hive 0.12

Release Theme Speed, Scale and SQL

Specific Features •  10x faster query launch when using large number (500+) of partitions

•  ORCFile predicate pushdown speeds queries •  Evaluate LIMIT on the map side •  Parallel ORDER BY •  New query optimizer •  Introduces VARCHAR and DATE datatypes •  GROUP BY on structs or unions

Included Components

Apache Hive 0.12



SPEED: Increasing Hive Performance

Performance Improvements included in Hive 12 –  Base & advanced query optimization –  Startup time improvement –  Join optimizations

Interactive Query Times across ALL use cases •  Simple and advanced queries in seconds •  Integrates seamlessly with existing tools •  Currently a >100x improvement in just nine months



Stinger Phase 3: Interactive Query In Hadoop

Hive 10 Trunk (Phase 3) Hive 0.11 (Phase 1)

190x Improvement

1400s

39s

7.2s

TPC-‐DS Query 27

3200s

65s

14.9s

TPC-‐DS Query 82

200x Improvement

Query 27: Pricing Analy'cs using Star Schema Join Query 82: Inventory Analy'cs Joining 2 Large Fact Tables

All Results at Scale Factor 200 (Approximately 200GB Data)



41.1s

4.2s

39.8s

4.1s TPC-‐DS Query 52 TPC-‐DS Query 55

Query Time in Seconds

Speed: Delivering Interactive Query

Test Cluster: •  200 GB Data (Impala: Parquet Hive: ORCFile) •  20 Nodes, 24GB RAM each, 6x disk each

Hive 0.12

Trunk (Phase 3)

Query 52: Star Schema Join Query 5: Star Schema Join



22s

9.8s

31s

6.7s TPC-‐DS Query 28 TPC-‐DS Query 12

Query Time in Seconds

Speed: Delivering Interactive Query

Test Cluster: •  200 GB Data (Impala: Parquet Hive: ORCFile) •  20 Nodes, 24GB RAM each, 6x disk each

Hive 0.12

Trunk (Phase 3)

Query 28: Vectoriza'on Query 12: Complex join (M-‐R-‐R pabern)



AMPLab Big Data Benchmark

45s

63s 63s

9.4s AMPLab Query 1a AMPLab Query 1b AMPLab Query 1c

Query Time in Seconds (lower is beXer)

1.6s 2.3s

AMPLab Query 1: Simple Filter Query

S:nger Phase 3 Cluster Configura:on: •  AMPLab Data Set (~135 GB Data) •  20 Nodes, 24GB RAM each, 6x Disk each

Hive 0.10 (5 node EC2)

Trunk (Phase 3)




466s

104.3s

490s

118.3s

552s

172.7s

AMPLab Query 2a AMPLab Query 2b AMPLab Query 2c Query Time in Seconds

(lower is beXer)

AMPLab Query 2: Group By IP Block and Aggregate



Trunk (Phase 3)






Trunk (Phase 3)

Query Time in Seconds (lower is beXer)

AMPLab Query 3: Correlate Page Rankings and Revenues Across Time

490s

145s

AMPLab Query 3b

466s

AMPLab Query 3a

40s



How Stinger Phase 3 Delivers Interactive Query

Feature Descrip'on Benefit

Tez Integra:on Tez is significantly beXer engine than MapReduce Latency

Vectorized Query Take advantage of modern hardware by processing thousand-‐row blocks rather than row-‐at-‐a-‐:me. Throughput

Query Planner

Using extensive sta:s:cs now available in Metastore to beXer plan and op:mize query, including predicate pushdown during compila:on to eliminate por:ons of input (beyond par::on pruning)

Latency

Cost Based Op:mizer (Op:q)

Join re-‐ordering and other op:miza:ons based on column sta:s:cs including histograms etc. Latency



SQL: Enhancing SQL Semantics

Hive SQL Datatypes Hive SQL Seman'cs INT SELECT, INSERT

TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY

BOOLEAN JOIN on explicit join key

FLOAT Inner, outer, cross and semi joins

DOUBLE Sub-‐queries in FROM clause

STRING ROLLUP and CUBE

TIMESTAMP UNION

BINARY Windowing Func:ons (OVER, RANK, etc)

DECIMAL Custom Java UDFs

ARRAY, MAP, STRUCT, UNION Standard Aggrega:on (SUM, AVG, etc.)

DATE Advanced UDFs (ngram, Xpath, URL)

VARCHAR Sub-‐queries in WHERE, HAVING

CHAR Expanded JOIN Syntax

SQL Compliant Security (GRANT, etc.)

INSERT/UPDATE/DELETE (ACID)

Hive 0.12

Available

Roadmap

SQL Compliance Hive 12 provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop



ORC File Format

• Columnar format for complex data types • Built into Hive from 0.11 • Support for Pig and MapReduce via HCat • Two levels of compression – Lightweight type-specific and generic

• Built in indexes – Every 10,000 rows with position information – Min, Max, Sum, Count of each column – Supports seek to row number



SCALE: Interactive Query at Petabyte Scale

Sustained Query Times Apache Hive 0.12 provides sustained acceptable query times even at petabyte scale

131 GB (78% Smaller)

File Size Comparison Across Encoding Methods Dataset: TPC-‐DS Scale 500 Dataset


Encoded with Text

Encoded with RCFile

Encoded with ORCFile

Encoded with Parquet


585 GB (Original Size) •  Larger Block Sizes

•  Columnar format arranges columns adjacent within the file for compression & fast access

Impala

Hive 12

Smaller Footprint Better encoding with ORC in Apache Hive 0.12 reduces resource requirements for your cluster



ORC File Format

• Hive 0.12 – Predicate Push Down – Improved run length encoding – Adaptive string dictionaries – Padding stripes to HDFS block boundaries

• Trunk – Stripe-based Input Splits – Input Split elimination – Vectorized Reader – Customized Pig Load and Store functions



Vectorized Query Execution

• Designed for Modern Processor Architectures – Avoid branching in the inner loop. – Make the most use of L1 and L2 cache.

• How It Works – Process records in batches of 1,000 rows – Generate code from templates to minimize branching.

• What It Gives – 30x improvement in rows processed per second. – Initial prototype: 100M rows/sec on laptop



HDFS Buffer Cache

• Use memory mapped buffers for zero copy – Avoid overhead of going through DataNode – Can mlock the block files into RAM

• ORC Reader enhanced for zero-copy reads – New compression interfaces in Hadoop

• Vectorization specific reader – Read 1000 rows at a time – Read into Hive’s internal representation



Next Steps

• Blog http://hortonworks.com/blog/delivering-on-stinger-a-phase-3-progress-update/ • Stinger Initiative http://hortonworks.com/labs/stinger/

• Stinger Beta: HDP-2.1 Beta, December, 2013

© Hortonworks Inc. 2013. Confidential and Proprietary.

© Hortonworks Inc. 2013. Confidential and Proprietary.

Thank You!

@acmurthy @alanfgates @owen_omalley @hortonworks

strata stinger talk october 2013

Technology

hortonworks hortonworks

tez tez

hive query performance

reliablestorage hortonworks

optimizations hortonworks

pbsql hortonworks

price hortonworks

dag of tez tasks hortonworks