hive: a data warehouse on hadoop based on facebook team’s paperon facebook team’s paper...

15
Hive: A data warehouse on Hadoop Based on Facebook Team’s paper 06/20/22 1

Upload: frank-powell

Post on 24-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Hive: A data warehouse on Hadoop

Based on Facebook Team’s paper

04/19/23 1

Page 2: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Motivation

• Yahoo worked on Pig to facilitate application deployment on Hadoop.– Their need mainly was focused on unstructured data

• Simultaneously Facebook started working on deploying warehouse solutions on Hadoop that resulted in Hive.– The size of data being collected and analyzed in industry

for business intelligence (BI) is growing rapidly making traditional warehousing solution prohibitively expensive.

– Hive attempts to the leverage the extensive SQL skilled workforce to do big data analytics

04/19/23 2

Page 3: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Hadoop MR• MR is very low level and requires customers to write

custom programs.• HIVE supports queries expressed in SQL-like language called

HiveQL which are compiled into MR jobs that are executed on Hadoop.

• Hive also allows MR scripts• It also includes MetaStore that contains schemas and

statistics that are useful for data explorations, query optimization and query compilation.

• At Facebook Hive warehouse contains tens of thousands of tables, stores over 700TB and is used for reporting and ad-hoc analyses by 200 Fb users (numbers at the time the paper was written)

04/19/23 3

Page 4: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Data model• Hive structures data into well-understood

database concepts such as: tables, rows, cols, partitions

• It supports primitive types: integers, floats, doubles, and strings

• Hive also supports: – associative arrays: map<key-type, value-type>– Lists: list<element type>– Structs: struct<file name: file type…>

• SerDe: serialize and deserialized API is used to move data in and out of tables

04/19/23 4

Page 5: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Query Language (HiveQL)

• Subset of SQL• Meta-data queries• Limited equality and join predicates• No inserts on existing tables (to preserve

worm property)– Can overwrite an entire table

04/19/23 5

Page 6: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Wordcount in Hive

FROM (MAP doctext USING 'python wc_mapper.py' AS

(word, cnt)FROM docsCLUSTER BY word) aREDUCE word, cnt USING 'pythonwc_reduce.py';

04/19/23 6

Page 7: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Session/tmstamp example

FROM (FROM session_tableSELECT sessionid, tstamp, dataDISTRIBUTE BY sessionid SORT BY tstamp) aREDUCE sessionid, tstamp, data USING

'session_reducer.sh';

04/19/23 7

Page 8: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Data Storage

• Tables are logical data units; table metadata associates the data in the table to hdfs directories.

• Hdfs namespace: tables (hdfs directory), partition (hdfs subdirectory), buckets (subdirectories within partition)

• /user/hive/warehouse/test_table is a hdfs directory

04/19/23 8

Page 9: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Hive architecture (from the paper)

04/19/23 9

Page 10: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Architecture• Metastore: stores system catalog• Driver: manages life cycle of HiveQL query as it moves thru’

HIVE; also manages session handle and session statistics• Query compiler: Compiles HiveQL into a directed acyclic

graph of map/reduce tasks• Execution engines: The component executes the tasks in

proper dependency order; interacts with Hadoop• HiveServer: provides Thrift interface and JDBC/ODBC for

integrating other applications.• Client components: CLI, web interface, jdbc/odbc inteface• Extensibility interface include SerDe, User Defined

Functions and User Defined Aggregate Function.

04/19/23 10

Page 11: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Sample Query Plan

04/19/23 11

Page 12: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Hive Usage in Facebook

• Hive and Hadoop are extensively used in Facbook for different kinds of operations.

• 700 TB = 2.1Petabyte after replication!• Think of other application models that can

leverage Hadoop MR.

04/19/23 12

Page 13: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

General approach to learning Hive

• Similar to RDBMS• DDL: Data definition language– Schema definition, create table, meta data defn.

• DML: Data Manipulation Language– Load table

• HiveQL: Query language– Select, insert etc.

• Underlying structure is either HDFS/hadoop or HBase

04/19/23 13

Page 14: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Demo on Amazon aws

Goal: – Understand amazon aws– Essential services among the many services

provided– Hive setup and query

•Demo URL: http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-tutorial.html

04/19/23 14

Page 15: Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151

Discussion of the Demo Problem

• Data: Apache collected web logs with IP address of the client (want to know who accessed your website? )

• Goal: web log is unstructured; big data; collect and clean the data and load using Hive; query the data using Hive– Deserialize and Serialize into and from Hive.. So the data is

called SerDe– Count the line– See the first line of data– Count the number of accesses from a particular IP

04/19/23 15