strata stinger talk october 2013
DESCRIPTION
Slides from the talk Apache Hive and Stinger, Petabyte scale SQL in Hadoop presented by Arun Murthy, Alan Gates, and Owen O'MalleyTRANSCRIPT
© Hortonworks Inc. 2013. © Hortonworks Inc. 2013.
Apache Hive and Stinger: SQL in Hadoop
Arun Murthy (@acmurthy) Alan Gates (@alanfgates) Owen O’Malley (@owen_omalley) @hortonworks
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
YARN: Taking Hadoop Beyond Batch
Page 2
Applica'ons Run Na'vely IN Hadoop
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH (MapReduce)
INTERACTIVE (Tez)
STREAMING (Storm, S4,…)
GRAPH (Giraph)
IN-‐MEMORY (Spark)
HPC MPI (OpenMPI)
ONLINE (HBase)
OTHER (Search) (Weave…)
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
Hadoop Beyond Batch with YARN
HADOOP 1
HDFS (redundant, reliable storage)
MapReduce (cluster resource management
& data processing)
HDFS2 (redundant, reliable storage)
YARN (opera:ng system: cluster resource management)
MapReduce (batch)
Others (varied)
HADOOP 2
Single Use System Batch Apps
Multi Use Data Platform Batch, Interactive, Online, Streaming, …
Tez (interac:ve)
A shift from the old to the new…
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
Apache Tez (“Speed”) • Replaces MapReduce as primitive for Pig, Hive, Cascading etc.
– Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft
YARN ApplicationMaster to run DAG of Tez Tasks
Task with pluggable Input, Processor and Output
Tez Task - <Input, Processor, Output>
Task
Processor Input Output
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
Tez: Building blocks for scalable data processing
Classical ‘Map’ Classical ‘Reduce’
Intermediate ‘Reduce’ for Map-Reduce-Reduce
Map Processor
HDFS Input
Sorted Output
Reduce Processor
Shuffle Input
HDFS Output
Reduce Processor
Shuffle Input
Sorted Output
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x
ORDER BY AVG;
SELECT a.state
JOIN (a, c) SELECT c.price
SELECT b.id
JOIN(a, b) GROUP BY a.state
COUNT(*) AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state, c.itemId
JOIN (a, c)
JOIN(a, b) GROUP BY a.state
COUNT(*) AVERAGE(c.price)
SELECT b.id
Tez avoids unneeded writes to
HDFS
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
Tez Sessions
… because Map/Reduce query startup is expensive
• Tez Sessions – Hot containers ready for immediate use – Removes task and job launch overhead (~5s – 30s)
• Hive – Session launch/shutdown in background (seamless, user not
aware) – Submits query plan directly to Tez Session
Native Hadoop service, not ad-hoc
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
Tez Delivers Interactive Query - Out of the Box!
Page 8
Feature Descrip'on Benefit
Tez Session Overcomes Map-‐Reduce job-‐launch latency by pre-‐launching Tez AppMaster Latency
Tez Container Pre-‐Launch
Overcomes Map-‐Reduce latency by pre-‐launching hot containers ready to serve queries. Latency
Tez Container Re-‐Use Finished maps and reduces pick up more work rather than exi:ng. Reduces latency and eliminates difficult split-‐size tuning. Out of box performance!
Latency
Run:me re-‐configura:on of DAG
Run:me query tuning by picking aggrega:on parallelism using online query sta:s:cs Throughput
Tez In-‐Memory Cache Hot data kept in RAM for fast access. Latency
Complex DAGs Tez Broadcast Edge and Map-‐Reduce-‐Reduce paXern improve query scale and throughput. Throughput
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
S'nger Project (announced February 2013)
Batch AND Interactive SQL-IN-Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE
Coming Soon:
• Hive on Apache Tez • Query Service • Buffer Cache • Cost Based Op:mizer (Op:q) • Vectorized Processing
Hive 0.11, May 2013: • Base Op:miza:ons • SQL Analy:c Func:ons • ORCFile, Modern File Format
Hive 0.12, October 2013:
• VARCHAR, DATE Types • ORCFile predicate pushdown • Advanced Op:miza:ons • Performance Boosts via YARN
Speed Improve Hive query performance by 100X to allow for interactive query times (seconds)
Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB
SQL Support broadest range of SQL semantics for analytic applications running against Hadoop
…all IN Hadoop
Goals:
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
Hive 0.12
Hive 0.12
Release Theme Speed, Scale and SQL
Specific Features • 10x faster query launch when using large number (500+) of partitions
• ORCFile predicate pushdown speeds queries • Evaluate LIMIT on the map side • Parallel ORDER BY • New query optimizer • Introduces VARCHAR and DATE datatypes • GROUP BY on structs or unions
Included Components
Apache Hive 0.12
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
SPEED: Increasing Hive Performance
Performance Improvements included in Hive 12 – Base & advanced query optimization – Startup time improvement – Join optimizations
Interactive Query Times across ALL use cases • Simple and advanced queries in seconds • Integrates seamlessly with existing tools • Currently a >100x improvement in just nine months
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
Stinger Phase 3: Interactive Query In Hadoop
Page 12
Hive 10 Trunk (Phase 3) Hive 0.11 (Phase 1)
190x Improvement
1400s
39s
7.2s
TPC-‐DS Query 27
3200s
65s
14.9s
TPC-‐DS Query 82
200x Improvement
Query 27: Pricing Analy'cs using Star Schema Join Query 82: Inventory Analy'cs Joining 2 Large Fact Tables
All Results at Scale Factor 200 (Approximately 200GB Data)
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
41.1s
4.2s
39.8s
4.1s TPC-‐DS Query 52 TPC-‐DS Query 55
Query Time in Seconds
Speed: Delivering Interactive Query
Test Cluster: • 200 GB Data (Impala: Parquet Hive: ORCFile) • 20 Nodes, 24GB RAM each, 6x disk each
Hive 0.12
Trunk (Phase 3)
Query 52: Star Schema Join Query 5: Star Schema Join
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
22s
9.8s
31s
6.7s TPC-‐DS Query 28 TPC-‐DS Query 12
Query Time in Seconds
Speed: Delivering Interactive Query
Test Cluster: • 200 GB Data (Impala: Parquet Hive: ORCFile) • 20 Nodes, 24GB RAM each, 6x disk each
Hive 0.12
Trunk (Phase 3)
Query 28: Vectoriza'on Query 12: Complex join (M-‐R-‐R pabern)
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
AMPLab Big Data Benchmark
Page 15
45s
63s 63s
9.4s AMPLab Query 1a AMPLab Query 1b AMPLab Query 1c
Query Time in Seconds (lower is beXer)
1.6s 2.3s
AMPLab Query 1: Simple Filter Query
S:nger Phase 3 Cluster Configura:on: • AMPLab Data Set (~135 GB Data) • 20 Nodes, 24GB RAM each, 6x Disk each
Hive 0.10 (5 node EC2)
Trunk (Phase 3)
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
AMPLab Big Data Benchmark
Page 16
466s
104.3s
490s
118.3s
552s
172.7s
AMPLab Query 2a AMPLab Query 2b AMPLab Query 2c Query Time in Seconds
(lower is beXer)
AMPLab Query 2: Group By IP Block and Aggregate
S:nger Phase 3 Cluster Configura:on: • AMPLab Data Set (~135 GB Data) • 20 Nodes, 24GB RAM each, 6x Disk each
Hive 0.10 (5 node EC2)
Trunk (Phase 3)
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
AMPLab Big Data Benchmark
Page 17
S:nger Phase 3 Cluster Configura:on: • AMPLab Data Set (~135 GB Data) • 20 Nodes, 24GB RAM each, 6x Disk each
Hive 0.10 (5 node EC2)
Trunk (Phase 3)
Query Time in Seconds (lower is beXer)
AMPLab Query 3: Correlate Page Rankings and Revenues Across Time
490s
145s
AMPLab Query 3b
466s
AMPLab Query 3a
40s
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
How Stinger Phase 3 Delivers Interactive Query
Page 18
Feature Descrip'on Benefit
Tez Integra:on Tez is significantly beXer engine than MapReduce Latency
Vectorized Query Take advantage of modern hardware by processing thousand-‐row blocks rather than row-‐at-‐a-‐:me. Throughput
Query Planner
Using extensive sta:s:cs now available in Metastore to beXer plan and op:mize query, including predicate pushdown during compila:on to eliminate por:ons of input (beyond par::on pruning)
Latency
Cost Based Op:mizer (Op:q)
Join re-‐ordering and other op:miza:ons based on column sta:s:cs including histograms etc. Latency
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
SQL: Enhancing SQL Semantics
Hive SQL Datatypes Hive SQL Seman'cs INT SELECT, INSERT
TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY
BOOLEAN JOIN on explicit join key
FLOAT Inner, outer, cross and semi joins
DOUBLE Sub-‐queries in FROM clause
STRING ROLLUP and CUBE
TIMESTAMP UNION
BINARY Windowing Func:ons (OVER, RANK, etc)
DECIMAL Custom Java UDFs
ARRAY, MAP, STRUCT, UNION Standard Aggrega:on (SUM, AVG, etc.)
DATE Advanced UDFs (ngram, Xpath, URL)
VARCHAR Sub-‐queries in WHERE, HAVING
CHAR Expanded JOIN Syntax
SQL Compliant Security (GRANT, etc.)
INSERT/UPDATE/DELETE (ACID)
Hive 0.12
Available
Roadmap
SQL Compliance Hive 12 provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
ORC File Format
• Columnar format for complex data types • Built into Hive from 0.11 • Support for Pig and MapReduce via HCat • Two levels of compression – Lightweight type-specific and generic
• Built in indexes – Every 10,000 rows with position information – Min, Max, Sum, Count of each column – Supports seek to row number
Page 20
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
SCALE: Interactive Query at Petabyte Scale
Sustained Query Times Apache Hive 0.12 provides sustained acceptable query times even at petabyte scale
131 GB (78% Smaller)
File Size Comparison Across Encoding Methods Dataset: TPC-‐DS Scale 500 Dataset
221 GB (62% Smaller)
Encoded with Text
Encoded with RCFile
Encoded with ORCFile
Encoded with Parquet
505 GB (14% Smaller)
585 GB (Original Size) • Larger Block Sizes
• Columnar format arranges columns adjacent within the file for compression & fast access
Impala
Hive 12
Smaller Footprint Better encoding with ORC in Apache Hive 0.12 reduces resource requirements for your cluster
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
ORC File Format
• Hive 0.12 – Predicate Push Down – Improved run length encoding – Adaptive string dictionaries – Padding stripes to HDFS block boundaries
• Trunk – Stripe-based Input Splits – Input Split elimination – Vectorized Reader – Customized Pig Load and Store functions
Page 22
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
Vectorized Query Execution
• Designed for Modern Processor Architectures – Avoid branching in the inner loop. – Make the most use of L1 and L2 cache.
• How It Works – Process records in batches of 1,000 rows – Generate code from templates to minimize branching.
• What It Gives – 30x improvement in rows processed per second. – Initial prototype: 100M rows/sec on laptop
Page 23
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
HDFS Buffer Cache
• Use memory mapped buffers for zero copy – Avoid overhead of going through DataNode – Can mlock the block files into RAM
• ORC Reader enhanced for zero-copy reads – New compression interfaces in Hadoop
• Vectorization specific reader – Read 1000 rows at a time – Read into Hive’s internal representation
© Hortonworks Inc. 2013.
© Hortonworks Inc. 2013.
Next Steps
• Blog http://hortonworks.com/blog/delivering-on-stinger-a-phase-3-progress-update/ • Stinger Initiative http://hortonworks.com/labs/stinger/
• Stinger Beta: HDP-2.1 Beta, December, 2013
© Hortonworks Inc. 2013. Confidential and Proprietary.
© Hortonworks Inc. 2013. Confidential and Proprietary.
Thank You!
@acmurthy @alanfgates @owen_omalley @hortonworks