hive on steroid
DESCRIPTION
Hive on steroid. Project stinger. Who Am I?. Olivier Renault Hortonworks Solution engineer for EMEA Join Hortonworks EMEA in Jan 2013 Eucalyptus – Open source Cloud solution Red Hat – Solution engineer. What’s Hive ?. Use HiveQL Hive translate SQL query into MapReduce job using - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/1.jpg)
© Hortonworks Inc. 2011© Hortonworks Inc. 2013
Hive on steroidProject stinger
![Page 2: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/2.jpg)
© Hortonworks Inc. 2013
Who Am I?
• Olivier Renault
• Hortonworks Solution engineer for EMEA– Join Hortonworks EMEA in Jan 2013
• Eucalyptus – Open source Cloud solution
• Red Hat – Solution engineer
![Page 3: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/3.jpg)
© Hortonworks Inc. 2013
What’s Hive ?
• Use HiveQL
• Hive translate SQL query into MapReduce job using
• De facto SQL interface in Hadoop
• Entry point for most BI tools– ODBC
• HCatalog merge with Hive– Metadata server
• Hive is able to query Pb of data
![Page 4: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/4.jpg)
© Hortonworks Inc. 2013
Hive: Strength Through Community
Page 4
Loyal Open Source Community and Real Corporate Interest/Contributions
FacebookTeradata
SAPIntel
MicrosoftHuaweiYahoo
…
Dozens of Vendorsintegrate with Hive
TeradataMicrosoft
MicrostrategyTableau
KarmasphereDatameer
Information BuildersSAP
OracleActuateQlikView
SASarcplanPentaho
JaspersoftTibcoTalend
Informatica…
Open Source
End Users
Vendors
Countless Enterprises Use Hive as the defacto SQL interface
to Hadoop data
![Page 5: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/5.jpg)
© Hortonworks Inc. 2013
Problem : Hive was slow …
• Hive is able to interact with visualization tools but you needed to be patient …
• February 2013, Hortonworks launch Stinger initiative. The aim is to improve Hive performance by 100x
• Bringing Hive in the interactive query world
![Page 6: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/6.jpg)
© Hortonworks Inc. 2013
Stinger Initiative
• Community initiative around Hive• Enables Hive to support interactive workloads• Enhances Hive’s standard SQL interface for Hadoop• Improves existing tools & preserves investments
Query Planner
Hive
Execution Engine
Tez
= 100X+ +File
Format
ORC file
![Page 7: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/7.jpg)
© Hortonworks Inc. 2013
Stinger Project(announced February 2013)
Batch AND Interactive SQL-IN-Hadoop
Stinger InitiativeA broad, community-based effort to drive the next generation of HIVE
Phase Three• Hive on Apache Tez• Query Service• Buffer Cache• Cost Based Optimizer (Optiq)• Vectorized Processing
Phase One
• Base Optimizations• SQL Analytic Functions• ORCFile, Modern File Format
Phase Two
• VARCHAR, DATE Types• ORCFile predicate pushdown• Advanced Optimizations• Performance Boosts via YARN
SpeedImprove Hive query performance by 100X to allow for interactive query times (seconds)
ScaleThe only SQL interface to Hadoop designed for queries that scale from TB to PB
SQLSupport broadest range of SQL semantics for analytic applications running against Hadoop
…all IN Hadoop
Goals:Deli
vered
Hive 0.
11
(HDP 1.
3)
Delive
red
Hive 0.
12
(HDP 2.
0)
Coming Soon
![Page 8: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/8.jpg)
© Hortonworks Inc. 2013
Hive : Base optimizationNew dags, analytics tools, ..
![Page 9: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/9.jpg)
© Hortonworks Inc. 2013
Hive Advanced Analytics
• Add OVER clause to support windowing queries– With standard arguments– Ranking functions
– rank, ntile, row_number, dense_rank
– With analytics functions:– cume_dist, first_value, lag, last_value, lead, percentile_cont, percentile_disc,
percent_rank
• Add CUBE and ROLLUP– Easily create summaries of your data
• Extend aggregation functions– STDDEV, VAR
Page 9
![Page 10: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/10.jpg)
© Hortonworks Inc. 2013
Hive Data Type Conformance
• Extend Hive to support additional types from SQL– Improves applications and interoperability between tools
• Specific additions– Add fixed point NUMERIC and DECIMAL type (in progress)– Add VARCHAR and CHAR types with limited field size– Add DATETIME– Add size ranges from 1 to 53 for FLOAT– Add synonyms for compatibility
– BLOB for BINARY
– TEXT for STRING
– REAL for FLOAT
Page 10
![Page 11: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/11.jpg)
© Hortonworks Inc. 2013
SQL: Enhancing SQL Semantics
Hive SQL Datatypes Hive SQL SemanticsINT SELECT, INSERTTINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BYBOOLEAN JOIN on explicit join keyFLOAT Inner, outer, cross and semi joinsDOUBLE Sub-queries in FROM clauseSTRING ROLLUP and CUBETIMESTAMP UNIONBINARY Windowing Functions (OVER, RANK, etc)DECIMAL Custom Java UDFsARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.)DATE Advanced UDFs (ngram, Xpath, URL) VARCHAR Sub-queries in WHERE, HAVINGCHAR Expanded JOIN Syntax
SQL Compliant Security (GRANT, etc.)
INSERT/UPDATE/DELETE (ACID)
Hive 0.12
Available
Roadmap
SQL ComplianceHive 12 provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop
![Page 12: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/12.jpg)
© Hortonworks Inc. 2013
Example Benchmark Spec
• The TPC-DS benchmark data+query set
• Query 27 – big table(store_sales) joins lots of small tables– A.K.A Star Schema Join
• What does Query 27 do?For all items sold in stores located in specified states during a given year, find the average quantity, average list price, average list sales price, average coupon amount for a given gender, marital status, education and customer demographic..
![Page 13: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/13.jpg)
© Hortonworks Inc. 2013
SELECT col5, avg(col6)
FROM store_sales_fact ssf
join item_dim on (ssf.col1 = item_dim .col1)
join date_dim on (ssf.col2 = date_dim.col2
join custdmgrphcs_dim on (ssf.col3 =custdmgrphcs_dim.col3)
join store_dim on (ssf.col4 = store_dim.col4)
GROUP BY col5
ORDER BY col5
LIMIT 100;
Query 27 - Star Schema Join
• Derived from TPC-DS Query 27
Page 13
41 GB
58 MB
11MB
80MB
106 KB
![Page 14: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/14.jpg)
© Hortonworks Inc. 2013
New Query Planner
![Page 15: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/15.jpg)
© Hortonworks Inc. 2013
Query27 Execution Before Hive 11-Text Format
Query spawned 5 MR Jobs
The intermediate output of each job is written to HDFS
Query Response Time
179 total mappers got executed
![Page 16: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/16.jpg)
© Hortonworks Inc. 2013
Query27 Execution With Hive 11-Text Format
Query spawned of 1 job with Hive 11 compared to 5 MR Jobs with Hive 10
Job 1 of 1 – Each Mapper loads into memory the 4 small dimension tables and streams parts of the large fact table. Joins then occur in Mapper hence the name MapJoin
Increase in performance with Hive 11 as query time went down from 21 minutes to about 4 minutes
![Page 17: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/17.jpg)
© Hortonworks Inc. 2013
Query27 Execution With Hive 11- RC Format
Conversion from Text to RC file format decreased size of dimension data set from 38 GB to 8.21 GB
Smaller file equates to less IO causing the query time to decrease from 246 seconds to 136 seconds
![Page 18: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/18.jpg)
© Hortonworks Inc. 2013
Query27 Execution With Hive 11- ORC Format
ORC File type consolidates data more tighly than RCFile as the size of dataset decreased from 8.21 GB to 2.83 GB
Smaller file equates to less IO causing the query time to decrease from 136 seconds to 104 seconds
![Page 19: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/19.jpg)
© Hortonworks Inc. 2013
Summary of Results
File Type Number of MR Jobs
Input Size Mappers Time
Text/Hive 10 5 43.1 GB 179 1260 Seconds
Text/Hive 11 1 38 GB 151 246 seconds
RC/Hive 11 1 8.21 GB 76 136 seconds
ORC/Hive 11 1 2.83 GB 38 104 seconds
RC/Hive 11/Partitioned/Bucketed
1 1.73 GB 19 104 seconds
ORC/Hive 11/Partitioned/Bucketed
1 687 MB 27 79.62
![Page 20: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/20.jpg)
© Hortonworks Inc. 2013
ORC file formatOptimized RC File
![Page 21: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/21.jpg)
© Hortonworks Inc. 2013
ORCFile - Optimized Column Storage
• Make a better columnar storage file– Evolve based on Google Dremel format
• Decompose complex row types into primitive fields– Better compression and projection
• Only read bytes from HDFS for the required columns.• Store column level aggregates in the files
– Only need to read the file meta information for common queries– Stored both for file and each section of a file– Aggregates: min, max, sum, average, count– Allows fast access by sorted columns
• Ability to add bloom filters for columns– Enables quick checks for whether a value is present
Page 24
![Page 22: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/22.jpg)
© Hortonworks Inc. 2013
ORCFile - File Layout
Page 25
![Page 23: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/23.jpg)
© Hortonworks Inc. 2013
Interactive Query at Scale
Sustained Query TimesApache Hive 0.12 provides sustained acceptable query times even at petabyte scale
131 GB(78% Smaller)
File Size Comparison Across Encoding MethodsDataset: TPC-DS Scale 500 Dataset
221 GB(62% Smaller)
Encoded withText
Encoded withRCFile
Encoded withORCFile
Encoded withParquet
505 GB(14% Smaller)
585 GB(Original Size) • Larger Block Sizes
• Columnar format arranges columns adjacent within the file for compression & fast access
Impala
Hive 12
Smaller FootprintBetter encoding with ORC in Apache Hive 0.12 reduces resource requirements for your cluster
![Page 24: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/24.jpg)
© Hortonworks Inc. 2011© Hortonworks Inc. 2013
Apache TezA New Hadoop Data Processing Framework
Page 27
![Page 25: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/25.jpg)
© Hortonworks Inc. 2013
Moving Hadoop Beyond MapReduce
• Low level data-processing execution engine• Built on YARN
• Enables pipelining of jobs• Removes task and job launch times• Does not write intermediate output to HDFS
– Much lighter disk and network usage
• New base of MapReduce, Hive, Pig, Cascading etc.• Hive and Pig jobs no longer need to move to the end of the queue
between steps in the pipeline
![Page 26: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/26.jpg)
© Hortonworks Inc. 2013
FastQuery: Beyond Batch with YARN
Page 29
Tez Generalizes Map-ReduceSimplified execution plans process
data more efficiently
Always-On Tez ServiceLow latency processing forall Hadoop data processing
![Page 27: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/27.jpg)
© Hortonworks Inc. 2013
Apache Tez as the new Primitive
HADOOP 1.0
HDFS(redundant, reliable storage)
MapReduce(cluster resource management
& data processing)
Pig(data flow)
Hive(sql)
Others(cascading)
HDFS2(redundant, reliable storage)
YARN(cluster resource management)
Tez(execution engine)
HADOOP 2.0
Data FlowPig
SQLHive
Others(cascading)
BatchMapReduce Real Time
Stream Processing
Storm
Online Data
ProcessingHBase,
Accumulo
MapReduce as Base Apache Tez as Base
![Page 28: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/28.jpg)
© Hortonworks Inc. 2013
Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-TezSELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY aUNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY xORDER BY AVG;
SELECT a.state
JOIN (a, c)SELECT c.price
SELECT b.id
JOIN(a, b)GROUP BY a.state
COUNT(*)AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,c.itemId
JOIN (a, c)
JOIN(a, b)GROUP BY a.state
COUNT(*)AVERAGE(c.price)
SELECT b.id
Tez avoids unneeded writes to
HDFS
![Page 29: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/29.jpg)
© Hortonworks Inc. 2013
Speed: Interactive Query In Hadoop
Page 32
Hive 10 Trunk (Phase 3)Hive 0.11 (Phase 1)
190xImprovement
1400s
39s
7.2sTPC-DS Query 27
3200s
65s
14.9s
TPC-DS Query 82
200xImprovement
Query 27: Pricing Analytics using Star Schema Join Query 82: Inventory Analytics Joining 2 Large Fact Tables
All Results at Scale Factor 200 (Approximately 200GB Data)
Test Cluster:• 200 GB Data (ORCFile)• 20 Nodes, 24GB RAM
each, 6x disk each
![Page 30: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/30.jpg)
© Hortonworks Inc. 2013 Page 33
There is NO second place
Hortonworks…the Bull Elephant of Hadoop Innovation
![Page 31: Hive on steroid](https://reader035.vdocument.in/reader035/viewer/2022062316/56816934550346895de08bb4/html5/thumbnails/31.jpg)
© Hortonworks Inc. 2013
Thank Youhortonworks.comhortonworks.com/sandbox