an intriduction to hive
DESCRIPTION
TRANSCRIPT
An Introduction to Apache HIVE
CreditsBy: Reza Ameri Semester: Fall 2013 Course: DDB Prof: Dr. Naderi
An Introduction to Apache HIVE
Agenda
• Starting Note– What is Hive– What is cool about Hive– Hive in use– What Hive is not?
• Brief About Data Warehouse
2 of 31
An Introduction to Apache HIVE
Agenda- Contd.
• Hive Architecture– Components– Architecture Diagram
• Hive in Production– HQL– Data Insertion/Aggregation
• Performance• Further Reading• References
3 of 31
An Introduction to Apache HIVE
Starting Note
• What is Apache Hive?– Open Source (Very Important!) So Free
– Data Warehouse System on Hadoop
– Provides HQL(SQL like query interface)
– Suitable for Structured and Semi-Structured Data
– Capability to deal with different storages and file formats
4 of 31
An Introduction to Apache HIVE
Starting Note- Contd.
• What is cool about Hive
– Let users use MR without thinking MR with HiveQL interface.
• Some history– Hive is made by Facebook!– Developing by Netflix aslo.– Amazon uses it in Amazon Elastic MapReduce
5 of 31
An Introduction to Apache HIVE
Starting Note- Contd.
• What Hive is not
– Does not use complex indexes so do not response in a seconds!
– But it scales very well and, It works with data of Peta Byte order
– It is not independent and it’s performance is tied Hadoop
6 of 31
An Introduction to Apache HIVE
Brief About Data Warehouse
• OLAP vs OLTP– DW is needed in OLAP– We want report and summary not live data of
transactions for continuing the operate– We need reports to make operation better not to
conduct and operation!– We use ETL to populate data in DW.
7 of 31
An Introduction to Apache HIVE
Brief About Data Warehouse
Inmon approach vs Kimbal approach
8 of 31
An Introduction to Apache HIVE
Brief About Data Warehouse
Inmon approach vs Kimbal approach
9 of 31
An Introduction to Apache HIVE
Brief About Data Warehouse
• Other keywords– ODS- Operational Data Store– Fact Tables– Data Mart– Dimensions– Concurrent ETLs
10 of 31
An Introduction to Apache HIVE
Hive Architecture
• Components– Hadoop– Driver– Command Line Interface (CLI)– Web Interface– Metastore– Thrift Server
11 of 31
An Introduction to Apache HIVE
Hive Architecture
12 of 31
An Introduction to Apache HIVE
Hive Architecture
13 of 31
HDFS
Map Reduce
Web UI + Hive CLI + JDBC/ODBC
Browse, Query, DDL
MetaStore
Thrift API
Hive QL
Parser
Planner
Optimizer
Execution SerDe
CSVThriftRegex
UDF/UDAF
substrsum
average
FileFormats
TextFileSequenceFile
RCFile
User-definedMap-reduce Scripts
An Introduction to Apache HIVE
Hive Architecture- Contd.
– Internal Components• Compiler and Planner
– It compiles and checks the input query and create an execution plan.
• Optimizer– It optimizes the execution plan before it runs.
• Execution Engine– Runs the execution plan. It is guaranteed that execution plan
is DAG
14 of 31
An Introduction to Apache HIVE
Hive Architecture- Contd.
• Hive Data Model– Any data in hive is categorized in• Databases
– First level of abstraction.
• Tables– Ordinary tables
• Partition– To handle data transferring in MR.
• Bucket– Facilitate the data access in partitions.
15 of 31
An Introduction to Apache HIVE
Hive in Production
• Log processing– Daily Report– User Activity Measurement
• Data/Text mining– Machine learning (Training Data)
• Business intelligence– Advertising Delivery– Spam Detection
16 of 31
An Introduction to Apache HIVE
Hive in Production
– HQL• Create• Row Format• SerDe• Select• Cluster By/Distribute By
– Data Insertion/Aggregation
17 of 31
An Introduction to Apache HIVE
HQL- Samples
• CREATE TABLECREATE TABLE movies (movie_id int, movie_name string, tags string)
• ROW FORMATROW FORMAT DELIMITED FIELDS TERMINATED BY ‘:’;
18 of 31
An Introduction to Apache HIVE
HQL- Samples
• Partitioncreate table table_name (id int,date string, name string)partitioned by (date string)
19 of 31
An Introduction to Apache HIVE
HQL- Samples
• SerDe– User Table with
“id::gender::age::occupation::zipcode” format.CREATE TABLE USER (id INT, gender STRING, age INT, occupation STRING, zipcode INT)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'WITH SERDEPROPERTIES ("input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)");
20 of 31
An Introduction to Apache HIVE
HQL- Samples
• SelectSELECT * FROM movies LIMIT 10;
• Distribute By– Select * from movies distribute by tags;– Select the column to organize data while sending
it to reducer.
21 of 20
An Introduction to Apache HIVE
Hive Process
• Data Insertion/Aggregation –Bulk• ETL– Talend - Community version– Sqoop (SQl to hadOOP, Apache license)– SyncSort – Not Free!
22 of 31
An Introduction to Apache HIVE
Hive Process- Contd.
– STP(Straight Through Processing)• Flume – Apache lisenced• Chukwa - a part of Apache Hadoop distribution• Scribe – Facebook solution for log processing
and aggregation.
23 of 31
An Introduction to Apache HIVE
Hive Process- Contd.
• NetFlix Case Study– Usage of Chukwa– Log processing– Count Errors per session– Count Streams per day– Ad-hoc queries like summaries (sum, max, min, …)
24 of 31
An Introduction to Apache HIVE
Hive Process- Contd.
25 of 31
An Introduction to Apache HIVE
Hive Process- Contd.
• Phase 1– Hadoop job parses the logs and loads to Hive
every hour.– Previous job should also run every 24 hours for
summary• Phase 2– Real-time log processing(parse/merge/load)– Chukwa has non-stop log collection.
26 of 31
An Introduction to Apache HIVE
Performance
• According to Globant investigations• Tables:
27 of 31
An Introduction to Apache HIVE
Performance
28 of 31
An Introduction to Apache HIVE
Performance
29 of 31
An Introduction to Apache HIVE
Further Reading
• Apache Drill– Software framework that supports data-intensive, distributed
applications, for interactive analysis of large-scale datasets• PIG
– MR Platform for creating and using MR on Hadoop• Oracle Big Data• DB2 10 and InfoSphere Warehouse• Parallel databases: Gamma, Bubba, Volcano• Google: Sawzall• Yahoo: Pig• IBM: JAQL• Microsoft: DradLINQ , SCOPE
30 of 31
An Introduction to Apache HIVE
References
• https://www.facebook.com/note.php?note_id=89508453919• https://github.com/facebook/scribe• http://sqoop.apache.org/docs/• http://flume.apache.org/FlumeDeveloperGuide.html• Sqoop Database Import For Hadoop, Cloudera, Oct.2009• https://
cwiki.apache.org/confluence/display/Hive/LanguageManual• http://www.semantikoz.com/blog/the-free-apache-hive-book/• BEGINNING MICROSOFT® SQL SERVER® 2012 PROGRAMMING,
Wiley, Paul Atkinson and Robert Vieira, ISBN: 978-1-118-10228-2• Hive – A Petabyte Scale Data Warehouse Using Hadoop, facebook
team, 2009
31 of 31
An Introduction to Apache HIVE
Thanks…