jive with hive - guss.proguss.pro/.../uploads/2013/10/sqlsaturday-paris-2013-jive-with-hive.… ·...
TRANSCRIPT
![Page 1: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/1.jpg)
Jive with Hive
![Page 2: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/2.jpg)
Allan Mitchell
• Joint author on 2005/2008 SSIS Book by Wrox
• Websites
– www.CopperBlueConsulting.com
• Specialise in Data and Process Integration
• Microsoft SQL Server MVP
• Twitter: allanSQLIS
• E: [email protected]
![Page 3: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/3.jpg)
3
Agenda
Hive solves the business problem of analyzing large amounts of data
• A brief summary of Hadoop and Big Data • What is the purpose of Hive? • Why Hive? • A history of Hive • What are Hive’s constituents
![Page 4: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/4.jpg)
4
Agenda
Hive solves the business problem of analyzing large amounts of data
• A brief summary of Hadoop and Big Data • What is the purpose of Hive? • Why Hive? • A history of Hive • What are Hive’s constituents
![Page 5: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/5.jpg)
5
What is Big Data
• Traditionally: •Physics Experiments, Sensor data, Satellite data, …
• Now: •Operational Logs •Customer behavior •Social interactions online •…
• From Terabytes in the 1990s over Petabytes today to Zetabytes in the future
![Page 6: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/6.jpg)
6
What is Big Data
“When you have to innovate to collect, store, organize, analyse and share it” -Werner Vogels Amazon CTO
![Page 7: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/7.jpg)
What is Hadoop?
“Flexible and Available Architecture for Large Scale computation and data processing on a network of highly available commodity hardware.”
![Page 8: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/8.jpg)
Distributed Storage (HDFS)
HDInsight Ecosystem
Distributed Processing (Map Reduce)
OD
BC
(Azure Data Marketplace)
Windows Azure Storage
![Page 9: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/9.jpg)
9
HDInsight
• Hadoop • Collaboration with Hortonworks • Sandbox Download – Single node cluster • Azure offering • HDP 1.3 for Windows – multi-node cluster
![Page 10: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/10.jpg)
Hadoop’s Lineage
* Resource: Kerberos Konference (Yahoo) – 2010
![Page 11: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/11.jpg)
Hadoop Key Terms
![Page 12: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/12.jpg)
12
Agenda Hive solves the business problem of analyzing large amounts of data • A brief summary of Hadoop • What is the purpose of Hive? • Why Hive? • A history of Hive • What are Hive’s constituents
![Page 13: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/13.jpg)
What is the purpose of Hive?
13
Hive is a solution to a business problem: How do you analyze large amounts of data?
Data Scientists want to study data Communicate with the data
Businesses want to reap benefits of data Results that make sense of the data
![Page 14: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/14.jpg)
14
![Page 15: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/15.jpg)
What is the purpose of Hive?
15
Hive is a data warehousing system for Hadoop To meet the needs of businesses, data scientists, analysts and BI professionals
Data, Summarized Fit a structure onto data
Data, Analyzed Analysis of Large Datasets stored in Hadoop File Systems SQL-Like language called HiveQL Custom mappers and reduces when HiveQL isn’t enough
![Page 16: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/16.jpg)
16
Agenda
• Hive solves the business problem of analyzing large amounts of data
• What is the purpose of Hive? • Why Hive? • A history of Hive • What are Hive’s constituents
![Page 17: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/17.jpg)
Why Hive?
17
Can’t Hadoop be used to solve these problems? Why is there a need for Hive?
Writing MR jobs in Java can be difficult You don’t know it’s wrong until it’s fallen over!
Joining Large Datasets can be difficult Learning Curve Ordering Datasets requires being a Ninja
![Page 18: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/18.jpg)
Agenda
• Hive solves the business problem of analyzing large amounts of data
• What is the purpose of Hive? • Why Hive? • A history of Hive • What are Hive’s constituents
![Page 19: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/19.jpg)
Hive History
19
![Page 20: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/20.jpg)
Hive History
20
![Page 21: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/21.jpg)
What can Hive offer you?
Hive can help with a range of business problems:
• Log Processing • Predictive Modelling • Hypothesis testing • And Business Intelligence
![Page 22: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/22.jpg)
Hive is not a replacement for SQL
So don’t throw out your SQL Server instances!
• Hive is for processing large data sets that may span hundreds, or even thousands, of machines
• Hive has a high overhead for starting a job. It translates queries to MR so it takes time
• Hive does not cache data, like SQL Server • Hive performance tuning is mainly Hadoop performance tuning • Similarity of the query engine, but different architectures for different
purposes
![Page 23: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/23.jpg)
Agenda Hive solves the business problem of analyzing large amounts of data • What is the purpose of Hive? • Why Hive? • A history of Hive • What are Hive’s constituents?
Hive as a SQL-like Language Query Tool Hive as a Translation Tool Hive as a Structuring Tool
![Page 24: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/24.jpg)
HiveQL
24
Hive QL is a SQL-like language It outputs naturally occurring groups for further analysis
Easy Data Summarization Large Datasets, summarized Fit a structure onto data
Analysis of Large Datasets stored in Hadoop file systems SQL-Like language called HiveQL Custom mappers and reduces when HiveQL isn’t enough
![Page 25: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/25.jpg)
HiveQL Queries like SQL Queries?
25
Similarities in Syntax and Features
Similar features
SELECT FROM WHERE GROUP BY / HAVING Table Aliases Computed Columns
![Page 26: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/26.jpg)
HiveQL Queries like SQL Queries?
26
Similarities in Syntax and Features
Similar features
Aggregate Functions Nested Select CASE LIKE / RLIKE JOIN ORDER BY / SORT BY
![Page 27: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/27.jpg)
How does Hive work?
27
Hive as a structuring Tool Creates a schema around the data Tables stored in Directories
Hive Tables Rows and columns, like SQL tables
Hive Metastore Namespace with a set of tables Holds table definitions Physical Layout Column Types Partition Information
![Page 28: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/28.jpg)
Hive DEMO
28
![Page 29: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/29.jpg)
Hive – Create a Table v2 CREATE EXTERNAL TABLE Ext ( Exch string, Symbol string, date string, val float ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘ LOCATION ‘asv:///inputfiles/’;
![Page 30: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/30.jpg)
Hive – Create a Table v3 INSERT OVERWRITE TABLE SomeTable SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid) FROM ADifferentTable GROUP BY pv_users.gender;
![Page 31: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/31.jpg)
Hive – Create a Table v4 CREATE TABLE SomeTable LIKE ADifferentTable
![Page 32: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/32.jpg)
Hive – Create a Table v5 CREATE TABLE SomeTable AS SELECT * FROM AnotherTable;
![Page 33: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/33.jpg)
Notes about creating tables
• INTERNAL – Means Hadoop manages the whole deal
– Drop the table then the data goes too
– Useful for temporary objects
– Cannot specify INTERNAL on CREATE statement
– Default
• EXTERNAL – Hadoop manages the metadata
– Drop just drops the metadata not the data
– Must use EXTERNAL keyword
![Page 34: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/34.jpg)
Hive – Joining Tables
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key) WHERE a.ds='2009-07-07' AND b.ds='2009-07-07‘ SELECT a.key, a.val FROM a LEFT SEMI JOIN b on (a.key = b.key)
![Page 35: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/35.jpg)
Hive – Sampling
SELECT * FROM source TABLESAMPLE(0.1 PERCENT);
SELECT * FROM source LIMIT 10
![Page 36: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/36.jpg)
Hive – Group By
SELECT stock_symbol, dt, COUNT(*) FROM mytable GROUP BY stock_symbol, dt;
![Page 37: Jive with Hive - guss.proguss.pro/.../uploads/2013/10/SQLSaturday-Paris-2013-Jive-with-Hive.… · Jive with Hive . Allan Mitchell •Joint author on 2005/2008 SSIS Book by Wrox •Websites](https://reader031.vdocument.in/reader031/viewer/2022041017/5ec9ca4d346ec16a4a0d23de/html5/thumbnails/37.jpg)
Hive – Filter
SELECT * FROM mytable WHERE stock_symbol = 'NAC'; SELECT * FROM mytable WHERE stock_symbol == 'NAC';