hive - indico · hive architecture jdbc odbc thrift api hadoop cluster node hiveserver2 ft e hive...

18
Hive

Upload: others

Post on 20-May-2020

32 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Hive

Page 2: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

What is Hive?

• Data warehousing layer on top of Hadoop– table abstractions

• SQL-like language (HiveQL) for “batch” data processing

• SQL is translated into one or series of MapReduceexecutions

• Good for ad-hoc reporting queries on HDFS data– however generated MR executions can be sub optimal

Page 3: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

What is not…

• Not a relational database

– not transactions

– no row updates

– no indexes

– no constraints

• No interactive querying

• No DBMS

Page 4: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Hive overview

Metadata(table definitions, data locations

Hive Engine(compiler, optimizer, executer)

HDFSHadoop Distributed File System

YARNCluster resource manager

MapReduce

Lookups

Dir

ect

dat

a re

adin

g

Data processing

Page 5: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Hadoop Cluster Node

Hive Architecture

JDBC

ODBC

Thrift API

Hadoop Cluster Node

HiveServer2

Thri

ftSe

rvic

e HiveDriverHive

DriverHiveDriver

MetastoreServer

Thri

ftSe

rvic

e

Metadata RDBMS(table definitions, data locations)

DB driver

Compilation

Optimization

Execution

Cluster Node

HDFS

MapReducetasks

Cluster Node

HDFS

MapReducetasks

Cluster Node

HDFS

MapReducetasks

Cluster Node

HDFS

MapReducetasks

SQL master node

HDFS

MapReducetask

YARN

Request

Request

Request

lookupsupdates

Page 6: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Metastore

• Contains tables definitions and data location on HDFS

• Stored in additional RDBMS

– very often on one of a cluster machines

– MySQL, PostgreSQL, Oracle, Derby…

• Used by many other Hadoop components

– via HCatalog service

Page 7: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Hive Table

• Table– Definitions stored in a Hive metastore (RDBMS)– Data are stored in files on HDFS

• Data files can be in various formats– but a type has to by unique within a single table

• Table partitioning and bucketing is supported

• EXTERNAL tables (DMLs are not possible)

• Table statistics can be gathered – important for getting optimal execution plans

Page 8: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Interacting with Hive

• Remotely

– JDBC and ODBC drivers

– Thrift API

– beeline (via JDBC, HiveServer2 support)

• Locally

– hive-shell (deprecated)

– beeline

Page 9: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Operations

• Based on SQL-92 specification

• DDL– CREATE TABLE, ALTER TABLE, DROP TABLE….

• DML– INSERT, INSERT OVERWRITE…

• SQL– SELECT…

• DISTINCT…JOIN WHERE…GROUP BY…HAVING…ORDER

BY…LIMIT

– REGEXP supported

– Subqueries only in the FROM cluase

Page 10: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Data Types

• TINYINT – 1 byte• BOOLEAN• SMALLINT – 2 bytes• INT – 4 bytes• BIGINT – 8 bytes• DOUBLE• STRING• STRUCT – named fields• MAP – key-value pairs collection • ARRAY – order collection of records in the same type

Page 11: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Other features

• Views

• Build in functions

– floor, rand, cast, case, if, concat, substr etc

– ‘show functions’

• User defined functions

– have to be delivered in jar

Page 12: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Using Hive CLIs

• Starting hive shell (deprecated)

• Use beeline instead (supports new HiveServer2)

– Connection in remote mode

– Connection in embedded mode

> hive

>beeline

!connect jdbc:hive2://localhost:10000/default

!connect jdbc:hive2://

Page 13: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Useful Hive commands• Get all databases

• Set a default database

• Show tables in a database

• Show table definition

• Explaining plan

show databases

use <db_name>

show tables

desc <table_name>

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] <query>

Page 14: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Hands on Hive (1) • All scripts are available with:

• To execute a script in beeline

• Creation of an external table from existing data (name=geneva)– external.sql

• Creation of a external table without “ (name=geneva_clean)– external_clean.sql

• Creation of a local table ‘as select’ from external (name=weather)– standard.sql

• Querying the data– queries.sql

mkdir hive; cd hivewget https://cern.ch/zbaranow/hive.zipunzip hive.ziphdfs dfs –put ~/tutorials/data data

!run <script_name>

Page 15: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Hands-on Hive (2)

• Explain plan– explain.sql

• Table statistics– stats.sql

• Creation of a partitioned table (name=weather_part)– partitioned.sql

• Creation of partitioned and bucketed table (name=weather_part_buck)– bucketing.sql

• Creation of a table stored in a parquet format (name=weather_parquet)– parquet.sql

• Creation of a compressed table (name=weather_compr)– compressed.sql

Page 16: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Hands on Hive (JDBC)

• Compile the code

• Run

– set classpath

– Execute

javac HiveJdbcClient.java

java -cp $CLASSPATH HiveJdbcClient

source ./setHiveEnv.sh

Page 17: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

This talk does not cover…

• Views– Object representing a sql statement

• SerDe– Serializer and Deserializer of data

– There are predefined for common data formats

– Custom Ser/De can be written by a user

• Writing UDF

• Querying Hbase

Page 18: Hive - Indico · Hive Architecture JDBC ODBC Thrift API Hadoop Cluster Node HiveServer2 ft e Hive Driver Hive HiveDriver Driver Metastore Server rift e Metadata RDBMS (table definitions,

Summary

• Provides table abstraction layer on HDFS data

• SQL on Hadoop translated to MapReduce jobs

• The Hive Query Language has some limitations

• For batch processing, not interactive

• Can append data

– But not row updates or deletions