an in-depth look at putting the sting in hive
DESCRIPTION
Apache Hive is the most widely used SQL interface for Hadoop. As Hadoop usage continues its explosive growth, Hive`s performance and features do not meet the requirements and expectation of many users. This includes answering queries in human time (less than 30 seconds) and support for common analytics operations. The Hive community has risen to the challenge. Work is being done to drive down start up time of a Hive query, extend Hive to work on Tez (a Hadoop execution environment that is much faster than MapReduce), make Hive operators process records at 10x more than their current speed, add support for analytics and windowing functions such as RANK, NTILE, LEAD, LAG, etc., and add support to Hive for standard SQL datatypes. This talk will discuss the design and code changes that have been done as well as look at ongoing work and additional optimizations and features that could be added in the future.TRANSCRIPT
Putting the Sting in
Hive
Page 1
Alan F. Gates
@alanfgates
Stinger Overview
Page 2
•An initiative, not a project or product
• Includes changes to Hive and a new project Tez
•Two main goals
–Improve Hive performance 100x over Hive 0.10
–Extend Hive SQL to include features needed for
analytics
•Hive will support:
–BI tools connecting to Hadoop
–Analysts performing ad-hoc, interactive queries
–Still excellent at the large batch jobs it is used for today
© 2013 Hortonworks
Stinger Mileposts
Page 3© 2013 Hortonworks
Stinger Phase 3• Buffer Cache• Cost Based
Optimizer
Stinger Phase 2
• YARN Resource Mgmnt• Hive on Apache Tez• Query Service• Vectorized Operators
Stinger Phase 1
• Base Optimizations• SQL Analytics• ORCFile Format
1 2Improve existing tools & preserve
investments
Enable Hive to support interactive
workloads
Released in
Hive 0.11
Current
WorkRoadmap
Hive Performance Gains in 0.11
Page 4© 2013 Hortonworks
• Enable star joins by improving Hive’s map join (aka
broadcast join)
–Where possible do in single map only task
–When not possible push larger tables to separate tasks
• Collapse adjacent jobs where possible
–Hive has lots of M->MR type plans, collapse these to MR
–Collapse adjacent jobs on sufficiently similar keys when
feasible
– join followed by group
– join followed by order
– group followed by order
• Improvements in sort merge bucket (SMB) joins
Page ‹#›
© Hortonworks Inc. 2013
Before
Page 6
© Hortonworks Inc. 2013
After
Page 7
© Hortonworks Inc. 2013
Improvements in SMB Joins
• TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks)
3257.692
2862.669
255.641
71.114
0
500
1000
1500
2000
2500
3000
3500
Query 82
Text
RCFile
Partitioned RCFile
Partitioned RCFile + Optimizations
Page 8
New Technologies in Hive
Page 9© 2013 Hortonworks
• All covered in depth in other talks
– See Owen’s, Eric’s, and Jitendra’s talk ORC File & Vectorization at 4:25 today
• Tez – A new execution engine for relational tools such as Hive
– No need to use MapReduce, instead provides general DAG execution
– Data moved between tasks via socket, disk, or HDFS based on performance / re-startability trade off
– Provides standing service to greatly reduce query start time
• ORCFile – A rewrite of RCFile
– Columnar
– Tightly integrated with Hive’s type model, including support for nested types
– Much better compression
– Supports projection and filter push down
• Vectorization – Rewriting operators to take advantage of modern processors
– Based on work done in MonetDB
– Rewrite operators to radically reduce number of function calls, branch prediction misses, and cache misses
© Hortonworks Inc. 2013
Standard Queries
Page 10
260
165
38
77
142
296
38 42
67
80
0
50
100
150
200
250
300
Query 27Scale 200
Query 82Scale 200
Query 27Scale 1000
Query 82Scale 1000
Query 27 Star JoinQuery 82 Fact Table Join
Hive 0.10, RC File
Hive 0.11 CP, RC File
Hive 0.11 CP, ORC File
© Hortonworks Inc. 2013
Performance Trajectory
Page 11
1X2X
12X11X
21X
0X
5X
10X
15X
20X
25X
Hive 10Text
Hive 10RC
Hive 11RC
Hive 11ORC
Hive 11 CPORC, Tez…
Query 27 Speedup
1X
14X
44X
57X
78X
0X
10X
20X
30X
40X
50X
60X
70X
80X
90X
Hive 10Text
Hive 10RC
Hive 11RC
Hive 11ORC
Hive 11 CPORC, Tez
Query 82 Speedup
© Hortonworks Inc. 2013
Query 12 – Demonstrating MRR
Page 12
55 54
75
65
35 34
55
46
0
10
20
30
40
50
60
70
80
RC File Scale 200
ORC File Scale 200
RC File Scale 1000
ORC File Scale 1000
Elap
sed
Tim
e (s
eco
nd
s)
Query 12 - MRR Optimization
Traditional Map-Reduce
Tez MapReduce Reduce
Hive Performance Up Next
Page 13© 2013 Hortonworks
• Push down start up time - even for queries that spend less than a
second running on the cluster, there is ~15 seconds of start up time
– Tez service will remove Hadoop startup issues
– Need to reduce time for the metadata access
– Need intelligent file caching so that hot tables can be kept in memory
• Keep working on the optimizer
– Y Smart work from Ohio State University
– Start using statistics to make intelligent decisions about how many mappers and
reducers to spawn – maybe in Hive, maybe in Tez
– Start using statistics to choose between competing plan options
• Buffer Cache
– Coordinate with HDFS team to determine caching strategy
Extending Hive SQL in 0.11
Page 14© 2013 Hortonworks
• DECIMAL data type – for fixed precision calculation (e.g. currency)
• OVER clause
– PARTITION BY, ORDER BY, ROWS
BETWEEN/FOLLOWING/PRECEDING
– Works with existing aggregate functions
– New analytic and window functions added
– ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, LEAD, FIRST_VALUE
, LAST_VALUE, NTILE, CUME_DIST, PERCENT_RANK
SELECT salesperson, AVG(salesprice) OVER
(PARTITION BY region ORDER BY date
ROWS BETWEEN 10 PRECEEDING AND 10 FOLLOWING)
FROM sales;
Extending Hive SQL Post 0.11
Page 15© 2013 Hortonworks
• Subqueries in WHERE
– Non-correlated first
– [NOT] IN first, then extend to (in)equalities and EXISTS
• Datatype conformance – Hive has Java type model, add support for
SQL types:
– DATE
– CHAR() and VARCHAR()
– add precision and scale to decimal and float
– aliases for standard SQL types (BLOB = binary, CLOB = string, integer =
int, real/number = decimal)
• Security
– Add security checks to views, indices, functions, etc.
– Secure GRANT and REVOKE
Questions
Page 16© 2013 Hortonworks