apache hive

93
Apache Hive The Apache Hive TM data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop TM , it provides Tools to enable easy data extract/transform/load (ETL) A mechanism to impose structure on a variety of data formats Access to files stored either directly in Apache HDFS TM or in other data storage systems such as Apache HBase TM Query execution via MapReduce Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's). Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats. Please see File Format and SerDe in the Developer Guide for details. Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs). What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose- coupling with its input formats. General Information about Hive Getting Started Presentations and Papers about Hive A List of Sites and Applications Powered by Hive FAQ hive-users Mailing List Hive IRC Channel: #hive on irc.freenode.net About This Wiki

Upload: ashok-kumar-k-r

Post on 29-Oct-2015

293 views

Category:

Documents


3 download

DESCRIPTION

hive contents

TRANSCRIPT

Page 1: Apache Hive

Apache HiveThe Apache HiveTM data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache HadoopTM , it provides

Tools to enable easy data extract/transform/load (ETL) A mechanism to impose structure on a variety of data formats Access to files stored either directly in Apache HDFSTM or in other data storage systems such

as Apache HBaseTM

Query execution via MapReduce

Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's).

Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats. Please see File Format and SerDe in the Developer Guide for details.

Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs). What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.

General Information about Hive Getting Started Presentations and Papers about Hive A List of Sites and Applications Powered by Hive FAQ hive-users Mailing List Hive IRC Channel: #hive on irc.freenode.net About This Wiki

User Documentation Hive Tutorial HiveQL Language Manual (Queries, DML, DDL, and CLI) Hive Operators and Functions Hive Web Interface Hive Client (JDBC, ODBC, Thrift, etc) HiveServer2 Client Hive Change Log

Page 2: Apache Hive

Avro SerDe

Administrator Documentation Installing Hive Configuring Hive Setting Up Metastore Setting Up Hive Web Interface Setting Up Hive Server (JDBC, ODBC, Thrift, etc.) Hive on Amazon Web Services Hive on Amazon Elastic MapReduce

Resources for Contributors Hive Developer FAQ How to Contribute Hive Contributors Meetings Hive Developer Guide Plugin Developer Kit Unit Test Parallel Execution Hive Performance Hive Architecture Overview Hive Design Docs Roadmap/Call to Add More Features Full-Text Search over All Hive Resources Becoming a Committer How to Commit How to Release Build Status on Jenkins (Formerly Hudson) Project Bylaws

For more information, please see the official Hive website.

Apache Hive, Apache Hadoop, Apache HBase, Apache HDFS, Apache, the Apache feather logo, and the Apache Hive project logo are trademarks of The Apache Software Foundation.

Child Pages (13)   Hide Child Pages  |  Reorder Pages

Page: AboutThisWiki Page: AvroSerDe Page: Bylaws Page: Dependent Tables Page: Hadoop-compatible Input-Output Format for Hive Page: HiveAmazonElasticMapReduce Page: HiveAwsEmr Page: HiveChangeLog Page: HiveDeveloperFAQ 

Page 3: Apache Hive

Page: HiveServer2 Clients Page: OperatorsAndFunctions Page: PluginDeveloperKit Page: RCFileCat 

Table of Contents

Installation and Configuration Requirements Installing Hive from a Stable Release Building Hive from Source

Compile hive on hadoop 23 Running Hive Configuration management overview Runtime configuration Hive, Map-Reduce and Local-Mode Error Logs

DDL Operations Metadata Store

DML Operations SQL Operations

Example Queries SELECTS and FILTERS GROUP BY JOIN MULTITABLE INSERT STREAMING

Simple Example Use Cases MovieLens User Ratings Apache Weblog Data

DISCLAIMER: Hive has only been tested on unix(linux) and mac systems using Java 1.6 for now -although it may very well work on other similar platforms. It does not work on Cygwin.Most of our testing has been on Hadoop 0.20 - so we advise running it against this version even though it may compile/work against other versions

Installation and Configuration

Requirements Java 1.6 Hadoop 0.20.x.

Page 4: Apache Hive

Installing Hive from a Stable Release

Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases).

Next you need to unpack the tarball. This will result in the creation of a subdirectory named hive-x.y.z:

$ tar -xzvf hive-x.y.z.tar.gz

Set the environment variable HIVE_HOME to point to the installation directory:

$ cd hive-x.y.z $ export HIVE_HOME={{pwd}}

Finally, add $HIVE_HOME/bin to your PATH:

$ export PATH=$HIVE_HOME/bin:$PATH

Building Hive from Source

The Hive SVN repository is located here: http://svn.apache.org/repos/asf/hive/trunk

$ svn co http://svn.apache.org/repos/asf/hive/trunk hive $ cd hive $ ant clean package $ cd build/dist $ ls README.txt bin/ (all the shell scripts) lib/ (required jar files) conf/ (configuration files) examples/ (sample input and query files)

In the rest of the page, we use build/dist and <install-dir> interchangeably.

Compile hive on hadoop 23 $ svn co http://svn.apache.org/repos/asf/hive/trunk hive $ cd hive $ ant clean package -Dhadoop.version=0.23.3 -Dhadoop-0.23.version=0.23.3 -Dhadoop.mr.rev=23 $ ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop-0.23.version=2.0.0-alpha -Dhadoop.mr.rev=23

Running Hive

Hive uses hadoop that means:

you must have hadoop in your path OR export HADOOP_HOME=<hadoop-install-dir>

In addition, you must create /tmp and /user/hive/warehouse(aka hive.metastore.warehouse.dir) and set them chmod g+w inHDFS before a table can be created in Hive.

Page 5: Apache Hive

Commands to perform this setup

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp $ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

I also find it useful but not necessary to set HIVE_HOME

$ export HIVE_HOME=<hive-install-dir>

To use hive command line interface (cli) from the shell:

$ $HIVE_HOME/bin/hive

Configuration management overview Hive default configuration is stored in <install-dir>/conf/hive-default.xml

Configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xml

The location of the Hive configuration directory can be changed by setting the HIVE_CONF_DIR environment variable.

Log4j configuration is stored in <install-dir>/conf/hive-log4j.properties Hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables

are inherited by default. Hive configuration can be manipulated by: Editing hive-site.xml and defining any desired variables (including hadoop variables) in it From the cli using the set command (see below) By invoking hive using the syntax:

$ bin/hive -hiveconf x1=y1 -hiveconf x2=y2this sets the variables x1 and x2 to y1 and y2 respectively

By setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above

Runtime configuration Hive queries are executed using map-reduce queries and, therefore, the behavior

of such queries can be controlled by the hadoop configuration variables. The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For

example: hive> SET mapred.job.tracker=myhost.mycompany.com:50030; hive> SET -v;

The latter shows all the current settings. Without the -v option only thevariables that differ from the base hadoop configuration are displayed

Hive, Map-Reduce and Local-Mode

Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:

mapred.job.tracker

Page 6: Apache Hive

While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets - in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets.

Starting v-0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option:

hive> SET mapred.job.tracker=local;

In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space).

Starting v-0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are:

hive> SET hive.exec.mode.local.auto=false;

note that this feature is disabled by default. If enabled - Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:

The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)

The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)

The total number of reduce tasks required is 1 or 0.

So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally.

Note that there may be differences in the runtime environment of hadoop server nodes and the machine running the hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option hive.mapred.local.mem. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm.

Error Logs

Hive uses log4j for logging. By default logs are not emitted to theconsole by the CLI. The default logging level is WARN and the logs are stored in the folder:

/tmp/<user.name>/hive.log

If the user wishes - the logs can be emitted to the console by addingthe arguments shown below:

bin/hive -hiveconf hive.root.logger=INFO,console

Alternatively, the user can change the logging level only by using:

bin/hive -hiveconf hive.root.logger=INFO,DRFA

Page 7: Apache Hive

Note that setting hive.root.logger via the 'set' command does notchange logging properties since they are determined at initialization time.

Hive also stores query logs on a per hive session basis in /tmp/<user.name>/, but can be configured in hive-site.xml with the hive.querylog.location property.

Logging during Hive execution on a Hadoop cluster is controlled by Hadoop configuration. Usually Hadoop will produce one log file per map and reduce task stored on the cluster machine(s) where the task was executed. The log files can be obtained by clicking through to the Task Details page from the Hadoop JobTracker Web UI.

When using local mode (using mapred.job.tracker=local), Hadoop/Hive execution logs are produced on the client machine itself. Starting v-0.6 - Hive uses the hive-exec-log4j.properties (falling back to hive-log4j.properties only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/<user.name>. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors.

Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to [email protected].

DDL OperationsCreating Hive tables and browsing through them

hive> CREATE TABLE pokes (foo INT, bar STRING);

Creates a table called pokes with two columns, the first being an integer and the other a string

hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

Creates a table called invites with two columns and a partition columncalled ds. The partition column is a virtual column. It is not partof the data itself but is derived from the partition that aparticular dataset is loaded into.

By default, tables are assumed to be of text input format and thedelimiters are assumed to be ^A(ctrl-a).

hive> SHOW TABLES;

lists all the tables

hive> SHOW TABLES '.*s';

lists all the table that end with 's'. The pattern matching follows Java regularexpressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html

hive> DESCRIBE invites;

shows the list of columns

As for altering tables, table names can be changed and additional columns can be dropped:

Page 8: Apache Hive

hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); hive> ALTER TABLE events RENAME TO 3koobecaf;

Dropping tables:

hive> DROP TABLE pokes;

Metadata Store

Metadata is in an embedded Derby database whose disk storage location is determined by thehive configuration variable named javax.jdo.option.ConnectionURL. By default(see conf/hive-default.xml), this location is ./metastore_db

Right now, in the default configuration, this metadata can only be seen byone user at a time.

Metastore can be stored in any database that is supported by JPOX. Thelocation and the type of the RDBMS can be controlled by the two variablesjavax.jdo.option.ConnectionURL and javax.jdo.option.ConnectionDriverName.Refer to JDO (or JPOX) documentation for more details on supported databases.The database schema is defined in JDO metadata annotations file package.jdoat src/contrib/hive/metastore/src/model.

In the future, the metastore itself can be a standalone server.

If you want to run the metastore as a network server so it can be accessedfrom multiple nodes try HiveDerbyServerMode.

DML OperationsLoading data from flat files into Hive:

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

Loads a file that contains two columns separated by ctrl-a into pokes table.'local' signifies that the input file is on the local file system. If 'local'is omitted then it looks for the file in HDFS.

The keyword 'overwrite' signifies that existing data in the table is deleted.If the 'overwrite' keyword is omitted, data files are appended to existing data sets.

NOTES:

NO verification of data against the schema is performed by the load command. If the file is in hdfs, it is moved into the Hive-controlled file system namespace.

The root of the Hive directory is specified by the option hive.metastore.warehouse.dirin hive-default.xml. We advise users to create this directory beforetrying to create tables via Hive.

hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');

Page 9: Apache Hive

The two LOAD statements above load data into two different partitions of the tableinvites. Table invites must be created as partitioned by the key ds for this to succeed.

hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

The above command will load data from an HDFS file/directory to the table.Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.

SQL Operations

Example Queries

Some example queries are shown below. They are available in build/dist/examples/queries.More are available in the hive sources at ql/src/test/queries/positive

SELECTS and FILTERS hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

selects column 'foo' from all rows of partition ds=2008-08-15 of the invites table. The results are notstored anywhere, but are displayed on the console.

Note that in all the examples that follow, INSERT (into a hive table, localdirectory or HDFS directory) is optional.

hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

selects all rows from partition ds=2008-08-15 of the invites table into an HDFS directory. The result datais in files (depending on the number of mappers) in that directory.NOTE: partition columns if any are selected by the use of *. They can alsobe specified in the projection clauses.

Partitioned tables must always have a partition selected in the WHERE clause of the statement.

hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;

Selects all rows from pokes table into a local directory

hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15'; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;

Page 10: Apache Hive

hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;

Sum of a column. avg, min, max can also be used. Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

GROUP BY hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar; hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

JOIN hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

MULTITABLE INSERT FROM src INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

STREAMING hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';

This streams the data in the map phase through the script /bin/cat (like hadoop streaming).Similarly - streaming can be used on the reduce side (please see the Hive Tutorial for examples)

Simple Example Use Cases

MovieLens User Ratings

First, create a table with tab-delimited text file format:

CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING)ROW FORMAT DELIMITED

Page 11: Apache Hive

FIELDS TERMINATED BY '\t'STORED AS TEXTFILE;

Then, download and extract the data files:

wget http://www.grouplens.org/system/files/ml-data.tar+0.gztar xvzf ml-data.tar+0.gz

And load it into the table that was just created:

LOAD DATA LOCAL INPATH 'ml-data/u.data'OVERWRITE INTO TABLE u_data;

Count the number of rows in table u_data:

SELECT COUNT(*) FROM u_data;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

Now we can do some complex data analysis on the table u_data:

Create weekday_mapper.py:

import sysimport datetime

for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\t'.join([userid, movieid, rating, str(weekday)])

Use the mapper script:

CREATE TABLE u_data_new ( userid INT, movieid INT, rating INT, weekday INT)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t';

add FILE weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_newSELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday)FROM u_data;

SELECT weekday, COUNT(*)FROM u_data_newGROUP BY weekday;

Note that if you're using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).

Page 12: Apache Hive

Apache Weblog Data

The format of Apache weblog is customizable, while most webmasters uses the default.For default Apache weblog, we can create a table with the following command.

More about !RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662

add jar ../build/contrib/hive_contrib.jar;

CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s")STORED AS TEXTFILE;

Hive User FAQ Hive User FAQ

I see errors like: Server access Error: Connection timed out url= http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz

How to change the warehouse.dir location for older tables? When running a JOIN query, I see out-of-memory errors. I am using MySQL as metastore and I see errors:

"com.mysql.jdbc.exceptions.jdbc4.!CommunicationsException: Communications link failure"

Does Hive support Unicode? HiveQL

Are HiveQL identifiers (e.g. table names, column names, etc) case sensitive? What are the maximum allowed lengths for HiveQL identifiers?

Importing Data into Hive How do I import XML data into Hive? How do I import CSV data into Hive? How do I import JSON data into Hive? How do I import Thrift data into Hive? How do I import Avro data into Hive? How do I import delimited text data into Hive? How do I import fixed-width data into Hive?

Page 13: Apache Hive

How do I import ASCII logfiles (HTTP, etc) into Hive? Exporting Data from Hive Hive Data Model

What is the difference between a native table and an external table? What are dynamic partitions? Can a Hive table contain data in more than one format? Is it possible to set the data format on a per-partition basis?

JDBC Driver Does Hive have a JDBC Driver?

ODBC Driver Does Hive have an ODBC driver?

I see errors like: Server access Error: Connection timed out url=http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz

Run the following commands:cd ~/.ant/cache/hadoop/core/sources

wget http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz

How to change the warehouse.dir location for older tables?

To change the base location of the Hive tables, edit the hive.metastore.warehouse.dir param. This will not affect the older tables. Metadata needs to be changed in the database (MySQL or Derby). The location of Hive tables is in table SDS and column LOCATION.

When running a JOIN query, I see out-of-memory errors.

This is usually caused by the order of JOIN tables. Instead of "FROM tableA a JOIN tableB b ON ...", try "FROM tableB b JOIN tableA a ON ...". NOTE that if you are using LEFT OUTER JOIN, you might want to change to RIGHT OUTER JOIN. This trick usually solve the problem - the rule of thumb is, always put the table with a lot of rows having the same value in the join key on the rightmost side of the JOIN.

I am using MySQL as metastore and I see errors: "com.mysql.jdbc.exceptions.jdbc4.!CommunicationsException: Communications link failure"

This is usually caused by MySQL servers closing connections after the connection is idling for some time. Run the following command on the MySQL server will solve the problem "set global wait_status=120;"

1. When using MySQL as a metastore I see the error "com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Specified key was too long; max key

Page 14: Apache Hive

length is 767 bytes".This is a known limitation of MySQL 5.0 and UTF8 databases. One option is to use another character set, such as 'latin1', which is known to work.

Does Hive support Unicode?

HiveQL

Are HiveQL identifiers (e.g. table names, column names, etc) case sensitive?

No. Hive is case insensitive.

Executing:

SELECT * FROM MyTable WHERE myColumn = 3

is strictly equivalent to

select * from mytable where mycolumn = 3

What are the maximum allowed lengths for HiveQL identifiers?

Importing Data into Hive

Page 15: Apache Hive

How do I import XML data into Hive?

How do I import CSV data into Hive?

How do I import JSON data into Hive?

How do I import Thrift data into Hive?

How do I import Avro data into Hive?

How do I import delimited text data into Hive?

How do I import fixed-width data into Hive?

How do I import ASCII logfiles (HTTP, etc) into Hive?

Exporting Data from Hive

Hive Data Model

What is the difference between a native table and an external table?

What are dynamic partitions?

Can a Hive table contain data in more than one format?

Is it possible to set the data format on a per-partition basis?

JDBC Driver

Page 16: Apache Hive

Does Hive have a JDBC Driver?

Yes. Look out to the hive-jdbc jar. The driver is 'org.apache.hadoop.hive.jdbc.HiveDriver'.

It supports two modes: a local mode and a remote one.

In the remote mode it connects to the hive server through its Thrift API. The JDBC url to use should be of the form: 'jdbc:hive://hostname:port/databasename'

In the local mode Hive is embedded. The JDBC url to use should be 'jdbc:hive://'.

ODBC Driver

Does Hive have an ODBC driver?

Hive Tutorial

Hive Tutorial Concepts

What is Hive What Hive is NOT Data Units Type System

Primitive Types Complex Types

Built in operators and functions Built in operators Built in functions

Language capabilities Usage and Examples

Creating Tables Browsing Tables and Partitions Loading Data Simple Query Partition Based Query Joins Aggregations Multi Table/File Inserts Dynamic-partition Insert Inserting into local files Sampling Union all Array Operations

Page 17: Apache Hive

Map(Associative Arrays) Operations Custom map/reduce scripts Co-Groups Altering Tables Dropping Tables and Partitions

Concepts

What is HiveHive is a data warehousing infrastructure based on the Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

What Hive is NOTHadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial overheads in job submission and scheduling. As a result - latency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as Oracle where analyses are conducted on a significantly smaller amount of data but the analyses proceed much more iteratively with the response times between iterations being less than a few minutes. Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.

Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs).

In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables and partitions (which are very similar to what you would find in a traditional relational DBMS) and then illustrate the capabilities of the QL language with the help of some examples.

Data UnitsIn the order of granularity - Hive data is organized into:

Databases: Namespaces that separate tables and other data units from naming confliction. Tables: Homogeneous units of data which have the same schema. An example of a table could

be page_views table, where each row could comprise of the following columns (schema):

Page 18: Apache Hive

timestamp - which is of INT type that corresponds to a unix timestamp of when the page was viewed.

userid - which is of BIGINT type that identifies the user who viewed the page. page_url - which is of STRING type that captures the location of the page. referer_url - which is of STRING that captures the location of the page from where the

user arrived at the current page. IP - which is of STRING type that captures the IP address from where the page request

was made. Partitions: Each Table can have one or more partition Keys which determines how the data is

stored. Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. For example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example all "US" data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only on the relevant partition of the table thereby speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean that it contains all or only data from that date; partitions are named after dates for convenience but it is the user's job to guarantee the relationship between partition name and data content!). Partition columns are virtual columns, they are not part of the data itself but are derived on load.

Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data.

Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution.

Type System

Primitive Types Types are associated with the columns in the tables. The following Primitive types are supported: Integers

TINYINT - 1 byte integer SMALLINT - 2 byte integer INT - 4 byte integer BIGINT - 8 byte integer

Boolean type BOOLEAN - TRUE/FALSE

Floating point numbers FLOAT - single precision DOUBLE - Double precision

String type STRING - sequence of characters in a specified character set

The Types are organized in the following hierarchy (where the parent is a super type of all the children instances):

Type

Page 19: Apache Hive

Primitive Type

Number

DOUBLE

BIGINT

INT

TINYINT

FLOAT

INT

TINYINT

STRING

BOOLEAN

This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2 type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Apart from these fundamental rules for implicit conversion based on type system, Hive also allows the special case for conversion:

<STRING> to <DOUBLE>

Explicit type conversion can be done using the cast operator as shown in the Built in functions section below.

Page 20: Apache Hive

Complex Types

Complex Types can be built up from primitive types and other composite types using:

Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a

Maps (key-value tuples): The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group']

Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'.

Using the primitive types and the constructs for creating complex types, types with arbitrary levels of nesting can be created. For example, a type User may comprise of the following fields:

gender - which is a STRING. active - which is a BOOLEAN.

Built in operators and functions

Built in operators Relational Operators - The following operators compare the passed operands and generate a

TRUE or FALSE value depending on whether the comparison between the operands holds or not.

Relational Operator

Operand types

Description

A = B all primitive types

TRUE if expression A is equivalent to expression B otherwise FALSE

A != B all primitive types

TRUE if expression A is not equivalent to expression B otherwise FALSE

A < B all primitive types

TRUE if expression A is less than expression B otherwise FALSE

A <= B all primitive types

TRUE if expression A is less than or equal to expression B otherwise FALSE

A > B all primitive

TRUE if expression A is greater than expression B otherwise FALSE

Page 21: Apache Hive

types

A >= B all primitive types

TRUE if expression A is greater than or equal to expression B otherwise FALSE

A IS NULL all types TRUE if expression A evaluates to NULL otherwise FALSE

A IS NOT NULL

all types FALSE if expression A evaluates to NULL otherwise TRUE

A LIKE B strings TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, 'foobar' LIKE 'foo' evaluates to FALSE where as 'foobar' LIKE 'foo___' evaluates to TRUE and so does 'foobar' LIKE 'foo%'. To escape % use \ (% matches one % character). If the data contains a semi-colon, and you want to search for it, it needs to be escaped, columnValue LIKE 'a\;b'

A RLIKE B strings TRUE if string A matches the Java regular expression B (See Java regular expressions syntax), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to FALSE whereas 'foobar' rlike '^f.*r$' evaluates to TRUE

A REGEXP B strings Same as RLIKE

Arithmetic Operators - The following operators support various common arithmetic operations on the operands. All of them return number types.

Arithmetic Operators

Operand types

Description

A + B all number types

Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. e.g. since every integer is a float, therefore float is a containing type of integer so the + operator on a float and an int will result in a float.

A - B all number types

Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A * B all number types

Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy.

A / B all number types

Gives the result of dividing B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. If the operands are integer

Page 22: Apache Hive

types, then the result is the quotient of the division.

A % B all number types

Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A & B all number types

Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A | B all number types

Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A ^ B all number types

Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

~A all number types

Gives the result of bitwise NOT of A. The type of the result is the same as the type of A.

Logical Operators - The following operators provide support for creating logical expressions. All of them return boolean TRUE or FALSE depending upon the boolean values of the operands.

Logical Operators

Operands types Description

A AND B boolean TRUE if both A and B are TRUE, otherwise FALSE

A && B boolean Same as A AND B

A OR B boolean TRUE if either A or B or both are TRUE, otherwise FALSE

A | B boolean Same as A OR B

NOT A boolean TRUE if A is FALSE, otherwise FALSE

!A boolean Same as NOT A

Operators on Complex Types - The following operators provide mechanisms to access elements in Complex Types

Operator Operand types Description

A[n] A is an Array and n is an int

returns the nth element in the array A. The first element has index 0 e.g. if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'

M[key] M is a Map<K, V> and key has type K

returns the value corresponding to the key in the map e.g. if M is a map comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'

S.x S is a struct returns the x field of S e.g for struct foobar {int foo, int bar} foobar.foo returns

Page 23: Apache Hive

the integer stored in the foo field of the struct.

Built in functions The following built in functions are supported in hive:

(Function list in source code: FunctionRegistry.java)

Return Type

Function Name (Signature) Description

BIGINT round(double a) returns the rounded BIGINT value of the double

BIGINT floor(double a) returns the maximum BIGINT value that is equal or less than the double

BIGINT ceil(double a) returns the minimum BIGINT value that is equal or greater than the double

double rand(), rand(int seed) returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic.

string concat(string A, string B,...) returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them.

string substr(string A, int start) returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar'

string substr(string A, int start, int length)

returns the substring of A starting from start position with the given length e.g. substr('foobar', 4, 2) results in 'ba'

string upper(string A) returns the string resulting from converting all characters of A to upper case e.g. upper('fOoBaR') results in 'FOOBAR'

string ucase(string A) Same as upper

string lower(string A) returns the string resulting from converting all characters of B to lower case e.g. lower('fOoBaR') results in 'foobar'

string lcase(string A) Same as lower

string trim(string A) returns the string resulting from trimming spaces from both ends of A e.g. trim(' foobar ') results in 'foobar'

string ltrim(string A) returns the string resulting from trimming spaces from the beginning(left

Page 24: Apache Hive

hand side) of A. For example, ltrim(' foobar ') results in 'foobar '

string rtrim(string A) returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar'

string regexp_replace(string A, string B, string C)

returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb'

int size(Map<K.V>) returns the number of elements in the map type

int size(Array<T>) returns the number of elements in the array type

value of <type>

cast(<expr> as <type>) converts the results of the expression expr to <type> e.g. cast('1' as BIGINT) will convert the string '1' to it integral representation. A null is returned if the conversion does not succeed.

string from_unixtime(int unixtime) convert the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00"

string to_date(string timestamp) Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01"

int year(string date) Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970

int month(string date) Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11

int day(string date) Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1

string get_json_object(string json_string, string path)

Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid

The following built in aggregate functions are supported in Hive:

Return Type

Aggregation Function Name (Signature)

Description

BIGINT count(*), count(expr), count(DISTINCT expr[, expr_.])

count(*) - Returns the total number of retrieved rows, including rows containing NULL values; count(expr) - Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-

Page 25: Apache Hive

NULL.

DOUBLE sum(col), sum(DISTINCT col)

returns the sum of the elements in the group or the sum of the distinct values of the column in the group

DOUBLE avg(col), avg(DISTINCT col)

returns the average of the elements in the group or the average of the distinct values of the column in the group

DOUBLE min(col) returns the minimum value of the column in the group

DOUBLE max(col) returns the maximum value of the column in the group

Language capabilitiesHive query language provides the basic SQL like operations. These operations work on tables or partitions. These operations are:

Ability to filter rows from a table using a where clause. Ability to select certain columns from the table using a select clause. Ability to do equi-joins between two tables. Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table. Ability to store the results of a query into another table. Ability to download the contents of a table to a local (e.g., nfs) directory. Ability to store the results of a query in a hadoop dfs directory. Ability to manage tables and partitions (create, drop and alter). Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.

Usage and ExamplesThe following examples highlight some salient features of the system. A detailed set of query test cases can be found at Hive Query Test Cases and the corresponding results can be found at Query Test Case Results

Creating TablesAn example statement that would create the page_view table mentioned above would be like:

CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE;

In this example the columns of the table are specified with the corresponding types. Comments can be attached both at the column level as well as at the table level. Additionally the partitioned by clause

Page 26: Apache Hive

defines the partitioning columns which are different from the data columns and are actually not stored with the data. When specified in this way, the data in the files is assumed to be delimited with ASCII 001(ctrl-A) as the field delimiter and newline as the row delimiter.

The field delimiter can be parametrized if the data is not in the above format as illustrated in the following example:

CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' STORED AS SEQUENCEFILE;

The row deliminator currently cannot be changed since it is not determined by Hive but Hadoop. e delimiters.

It is also a good idea to bucket the tables on certain columns so that efficient sampling queries can be executed against the data set. If bucketing is absent, random sampling can still be done on the table but it is not efficient as the query has to scan all the data. The following example illustrates the case of the page_view table that is bucketed on the userid column:

CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE;

In the example above, the table is clustered by a hash function of userid into 32 buckets. Within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries with greater efficiency.

CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, friends ARRAY<BIGINT>, properties MAP<STRING, STRING> ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE;

In this example the columns that comprise of the table row are specified in a similar way as the definition of types. Comments can be attached both at the column level as well as at the table level. Additionally the

Page 27: Apache Hive

partitioned by clause defines the partitioning columns which are different from the data columns and are actually not stored with the data. The CLUSTERED BY clause specifies which column to use for bucketing as well as how many buckets to create. The delimited row format specifies how the rows are stored in the hive table. In the case of the delimited format, this specifies how the fields are terminated, how the items within collections (arrays or maps) are terminated and how the map keys are terminated. STORED AS SEQUENCEFILE indicates that this data is stored in a binary format (using hadoop SequenceFiles) on hdfs. The values shown for the ROW FORMAT and STORED AS clauses in the above example represent the system defaults.

Table names and column names are case insensitive.

Browsing Tables and Partitions SHOW TABLES;

To list existing tables in the warehouse; there are many of these, likely more than you want to browse.

SHOW TABLES 'page.*';

To list tables with prefix 'page'. The pattern follows Java regular expression syntax (so the period is a wildcard).

SHOW PARTITIONS page_view;

To list partitions of a table. If the table is not a partitioned table then an error is thrown.

DESCRIBE page_view;

To list columns and column types of table.

DESCRIBE EXTENDED page_view;

To list columns and all other properties of table. This prints lot of information and that too not in a pretty format. Usually used for debugging.

DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');

To list columns and all other properties of a partition. This also prints lot of information which is usually used for debugging.

Loading DataThere are multiple ways to load data into Hive tables. The user can create an external table that points to a specified location within HDFS. In this particular usage, the user can copy a file into the specified location using the HDFS put or copy commands and create a table pointing to this location with all the relevant row format information. Once this is done, the user can transform the data and insert them into any other Hive table. For example, if the file /tmp/pv_2008-06-08.txt contains comma separated page views served on 2008-06-08, and this needs to be loaded into the page_view table in the appropriate partition, the following sequence of commands can achieve this:

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table'

Page 28: Apache Hive

ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' STORED AS TEXTFILE LOCATION '/user/data/staging/page_view';

hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US';

In the example above nulls are inserted for the array and map types in the destination tables but potentially these can also come from the external table if the proper row formats are specified.

This method is useful if there is already legacy data in HDFS on which the user wants to put some metadata so that the data can be queried and manipulated using Hive.

Additionally, the system also supports syntax that can load the data from a file in the local files system directly into a Hive table where the input data format is the same as the table format. If /tmp/pv_2008-06-08_us.txt already contains the data for US, then we do not need any additional filtering as shown in the previous example. The load in this case can be done using the following syntax:

LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US')

The path argument can take a directory (in which case all the files in the directory are loaded), a single file name, or a wildcard (in which case all the matching files are uploaded). If the argument is a directory - it cannot contain subdirectories. Similarly - the wildcard must match file names only.

In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, the user may decide to do a parallel load of the data (using tools that are external to Hive). Once the file is in HDFS - the following syntax can be used to load the data into a Hive table:

LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')

It is assumed that the array and map fields in the input.txt files are null fields for these examples.

Simple QueryFor all the active users, one can use the query of the following form:

INSERT OVERWRITE TABLE user_active SELECT user.* FROM user WHERE user.active = 1;

Note that unlike SQL, we always insert the results into a table. We will illustrate later how the user can inspect these results and even dump them to a local file. You can also run the following query on Hive CLI:

SELECT user.* FROM user WHERE user.active = 1;

Page 29: Apache Hive

This will be internally rewritten to some temporary file and displayed to the Hive client side.

Partition Based QueryWhat partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns. For example, in order to get all the page_views in the month of 03/2008 referred from domain xyz.com, one could write the following query:

INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com';

Note that page_views.date is used here because the table (above) was defined with PARTITIONED BY(date DATETIME, country STRING) ; if you name your partition something different, don't expect .date to do what you think!

JoinsIn order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would need to join the page_view table and the user table on the userid column. This can be accomplished with a join as shown in the following query:

INSERT OVERWRITE TABLE pv_users SELECT pv.*, u.gender, u.age FROM user u JOIN page_view pv ON (pv.userid = u.id) WHERE pv.date = '2008-03-03';

In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer join (left preserved, right preserved or both sides preserved). For example, in order to do a full outer join in the query above, the corresponding syntax would look like the following query:

INSERT OVERWRITE TABLE pv_users SELECT pv.*, u.gender, u.age FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id) WHERE pv.date = '2008-03-03';

In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by the following example.

INSERT OVERWRITE TABLE pv_users SELECT u.* FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id) WHERE pv.date = '2008-03-03';

In order to join more than one tables, the user can use the following syntax:

INSERT OVERWRITE TABLE pv_friends SELECT pv.*, u.gender, u.age, f.friends FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid)

Page 30: Apache Hive

WHERE pv.date = '2008-03-03';

Note that Hive only supports equi-joins. Also it is best to put the largest table on the rightmost side of the join to get the best performance.

AggregationsIn order to count the number of distinct users by gender one could write the following query:

INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender;

Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns .e.g while the following is possible

INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender;

however, the following query is not allowed

INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip) FROM pv_users GROUP BY pv_users.gender;

Multi Table/File InsertsThe output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilities). e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:

FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY pv_users.gender

INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum' SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY pv_users.age;

The first insert clause sends the results of the first group by to a Hive table while the second one sends the results to a hadoop dfs files.

Dynamic-partition Insert

Page 31: Apache Hive

In the previous examples, the user has to know which partition to insert into and only one partition can be inserted in one insert statement. If you want to load into multiple partitions, you have to use multi-insert statement as illustrated below.

FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK';

In order to load data into all country partitions in a particular day, you have to add an insert statement for each country in the input data. This is very inconvenient since you have to have the priori knowledge of the list of countries exist in the input data and create the partitions beforehand. If the list changed for another day, you have to modify your insert DML as well as the partition creation DDLs. It is also inefficient since each insert statement may be turned into a MapReduce Job.

Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table. This is a newly added feature that is only available from version 0.6.0. In the dynamic partition insert, the input column values are evaluated to determine which partition this row should be inserted into. If that partition has not been created, it will create that partition automatically. Using this feature you need only one insert statement to create and populate all necessary partitions. In addition, since there is only one insert statement, there is only one corresponding MapReduce job. This significantly improves performance and reduce the Hadoop cluster workload comparing to the multiple insert case.

Below is an example of loading data to all country partitions using one insert statement:

FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country

There are several syntactic differences from the multi-insert statement:

country appears in the PARTITION specification, but with no value associated. In this case, country is a dynamic partition column. On the other hand, ds has a value associated with it, which means it is a static partition column. If a column is dynamic partition column, its value will be coming from the input column. Currently we only allow dynamic partition columns to be the last column(s) in the partition clause because the partition column order indicates its hierarchical order (meaning dt is the root partition, and country is the child partition). You cannot specify a partition clause with (dt, country='US') because that means you need to update all partitions with any date and its country sub-partition is 'US'.

An additional pvs.country column is added in the select statement. This is the corresponding input column for the dynamic partition column. Note that you do not need to add an input column for the static partition column because its value is already known in the PARTITION clause. Note that the dynamic partition values are selected by ordering, not name, and taken as the last columns from the select clause.

Semantics of the dynamic partition insert statement:

Page 32: Apache Hive

When there are already non-empty partitions exists for the dynamic partition columns, (e.g., country='CA' exists under some ds root partition), it will be overwritten if the dynamic partition insert saw the same value (say 'CA') in the input data. This is in line with the 'insert overwrite' semantics. However, if the partition value 'CA' does not appear in the input data, the existing partition will not be overwritten.

Since a Hive partition corresponds to a directory in HDFS, the partition value has to conform to the HDFS path format (URI in Java). Any character having a special meaning in URI (e.g., '%', ':', '/', '#') will be escaped with '%' followed by 2 bytes of its ASCII value.

If the input column is a type different than STRING, its value will be first converted to STRING to be used to construct the HDFS path.

If the input column value is NULL or empty string, the row will be put into a special partition, whose name is controlled by the hive parameter hive.exec.default.partition.name. The default value is__HIVE_DEFAULT_PARTITION__. Basically this partition will contain all "bad" rows whose value are not valid partition names. The caveat of this approach is that the bad value will be lost and is replaced by__HIVE_DEFAULT_PARTITION__ if you select them Hive. JIRA HIVE-1309 is a solution to let user specify "bad file" to retain the input partition column values as well.

Dynamic partition insert could potentially resource hog in that it could generate a large number of partitions in a short time. To get yourself buckled, we define three parameters:

hive.exec.max.dynamic.partitions.pernode (default value being 100) is the maximum dynamic partitions that can be created by each mapper or reducer. If one mapper or reducer created more than that the threshold, a fatal error will be raised from the mapper/reducer (through counter) and the whole job will be killed.

hive.exec.max.dynamic.partitions (default value being 1000) is the total number of dynamic partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the total number of dynamic partitions does, then an exception is raised at the end of the job before the intermediate data are moved to the final destination.

hive.exec.max.created.files (default value being 100000) is the maximum total number of files created by all mappers and reducers. This is implemented by updating a Hadoop counter by each mapper/reducer whenever a new file is created. If the total number is exceeding hive.exec.max.created.files, a fatal error will be thrown and the job will be killed.

Another situation we want to protect against dynamic partition insert is that the user may accidentally specify all partitions to be dynamic partitions without specifying one static partition, while the original intention is to just overwrite the sub-partitions of one root partition. We define another parameter hive.exec.dynamic.partition.mode=strict to prevent the all-dynamic partition case. In the strict mode, you have to specify at least one static partition. The default mode is strict. In addition, we have a parameter hive.exec.dynamic.partition=true/false to control whether to allow dynamic partition at all. The default value is false.

In Hive 0.6, dynamic partition insert does not work with hive.merge.mapfiles=true or hive.merge.mapredfiles=true, so it internally turns off the merge parameters. Merging files in dynamic partition inserts are supported in Hive 0.7 (see JIRA HIVE-1307 for details).

Troubleshooting and best practices:

As stated above, there are too many dynamic partitions created by a particular mapper/reducer, a fatal error could be raised and the job will be killed. The error message looks something like:

hive> set hive.exec.dynamic.partition.mode=nonstrict; hive> FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt, country) SELECT pvs.viewTime, pvs.userid, pvs.page_url,

pvs.referrer_url, null, null, pvs.ip,

Page 33: Apache Hive

from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country;

... 2010-05-07 11:10:19,816 Stage-1 map = 0%, reduce = 0% [Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job. Ended Job = job_201005052204_28178 with errors ...

The problem of this that one mapper will take a random set of rows and it is very likely that the number of distinct (dt, country) pairs will exceed the limit of hive.exec.max.dynamic.partitions.pernode. One way around it is to group the rows by the dynamic partition columns in the mapper and distribute them to the reducers where the dynamic partitions will be created. In this case the number of distinct dynamic partitions will be significantly reduced. The above example query could be rewritten to: hive> set hive.exec.dynamic.partition.mode=nonstrict; hive> FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt, country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country DISTRIBUTE BY ds, country;This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions.

Inserting into local filesIn certain situations you would want to write the output into a local file so that you could load it into an excel spreadsheet. This can be accomplished with the following command:

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum' SELECT pv_gender_sum.* FROM pv_gender_sum;

SamplingThe sampling clause allows the users to write queries for samples of the data instead of the whole table. Currently the sampling is done on the columns that are specified in the CLUSTERED BY clause of the CREATE TABLE statement. In the following example we choose 3rd bucket out of the 32 buckets of the pv_gender_sum table:

INSERT OVERWRITE TABLE pv_gender_sum_sample SELECT pv_gender_sum.* FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32);

In general the TABLESAMPLE syntax looks like:

TABLESAMPLE(BUCKET x OUT OF y)

y has to be a multiple or divisor of the number of buckets in that table as specified at the table creation time. The buckets chosen are determined if bucket_number module y is equal to x. So in the above example the following tablesample clause

Page 34: Apache Hive

TABLESAMPLE(BUCKET 3 OUT OF 16)

would pick out the 3rd and 19th buckets. The buckets are numbered starting from 0.

On the other hand the tablesample clause

TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid)

would pick out half of the 3rd bucket.

Union allThe language also supports union all, e.g. if we suppose there are two different tables that track which user has published a video and which user has published a comment, the following query joins the results of a union all with the user table to create a single annotated stream for all the video publishing and comment publishing events:

INSERT OVERWRITE TABLE actions_users SELECT u.id, actions.date FROM ( SELECT av.uid AS uid FROM action_video av WHERE av.date = '2008-06-03'

UNION ALL

SELECT ac.uid AS uid FROM action_comment ac WHERE ac.date = '2008-06-03' ) actions JOIN users u ON(u.id = actions.uid);

Array OperationsArray columns in tables can only be created programmatically currently. We will be extending this soon to be available as part of the create table statement. For the purpose of the current example assume that pv.friends is of the type array<INT> i.e. it is an array of integers.The user can get a specific element in the array by its index as shown in the following command:

SELECT pv.friends[2] FROM page_views pv;

The select expressions gets the third item in the pv.friends array.

The user can also get the length of the array using the size function as shown below:

SELECT pv.userid, size(pv.friends) FROM page_view pv;

Map(Associative Arrays) OperationsMaps provide collections similar to associative arrays. Such structures can only be created programmatically currently. We will be extending this soon. For the purpose of the current example

Page 35: Apache Hive

assume that pv.properties is of the type map<String, String> i.e. it is an associative array from strings to string. Accordingly, the following query:

INSERT OVERWRITE page_views_map SELECT pv.userid, pv.properties['page type'] FROM page_views pv;

can be used to select the 'page_type' property from the page_views table.

Similar to arrays, the size function can also be used to get the number of elements in a map as shown in the following query:

SELECT size(pv.properties) FROM page_view pv;

Custom map/reduce scriptsUsers can also plug in their own custom mappers and reducers in the data stream by using features natively supported in the Hive language. e.g. in order to run a custom mapper script - map_script - and a custom reducer script - reduce_script - the user can issue the following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.

Note that columns will be transformed to string and delimited by TAB before feeding to the user script, and the standard output of the user script will be treated as TAB-separated string columns. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop.

FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script' AS dt, uid CLUSTER BY dt) map_output

INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.dt, map_output.uid USING 'reduce_script' AS date, count;

Sample map script (weekday_mapper.py )

import sysimport datetime

for line in sys.stdin: line = line.strip() userid, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print ','.join([userid, str(weekday)])

Of course, both MAP and REDUCE are "syntactic sugar" for the more general select transform. The inner query could also have been written as such:

SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS dt, uid CLUSTER BY dt FROM pv_users;

Page 36: Apache Hive

Schema-less map/reduce: If there is no "AS" clause after "USING map_script", Hive assumes the output of the script contains 2 parts: key which is before the first tab, and value which is the rest after the first tab. Note that this is different from specifying "AS key, value" because in that case value will only contains the portion between the first tab and the second tab if there are multiple tabs.

In this way, we allow users to migrate old map/reduce scripts without knowing the schema of the map output. User still needs to know the reduce output schema because that has to match what is in the table that we are inserting to.

FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script' CLUSTER BY key) map_output

INSERT OVERWRITE TABLE pv_users_reduced

REDUCE map_output.dt, map_output.uid USING 'reduce_script' AS date, count;

Distribute By and Sort By: Instead of specifying "cluster by", the user can specify "distribute by" and "sort by", so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.

FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script' AS c1, c2, c3 DISTRIBUTE BY c2 SORT BY c2, c1) map_output

INSERT OVERWRITE TABLE pv_users_reduced

REDUCE map_output.c1, map_output.c2, map_output.c3 USING 'reduce_script' AS date, count;

Co-GroupsAmongst the user community using map/reduce, cogroup is a fairly common operation wherein the data from multiple tables are sent to a custom reducer such that the rows are grouped by the values of certain columns on the tables. With the UNION ALL operator and the CLUSTER BY specification, this can be achieved in the Hive query language in the following way. Suppose we wanted to cogroup the rows from the actions_video and action_comments table on the uid column and send them to the 'reduce_script' custom reducer, the following syntax can be used by the user:

FROM ( FROM ( FROM action_video av SELECT av.uid AS uid, av.id AS id, av.date AS date

UNION ALL

Page 37: Apache Hive

FROM action_comment ac SELECT ac.uid AS uid, ac.id AS id, ac.date AS date ) union_actions SELECT union_actions.uid, union_actions.id, union_actions.date CLUSTER BY union_actions.uid) map

INSERT OVERWRITE TABLE actions_reduced SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script' AS (uid, id, reduced_val);

Altering TablesTo rename existing table to a new name. If a table with new name already exists then an error is returned:

ALTER TABLE old_table_name RENAME TO new_table_name;

To rename the columns of an existing table. Be sure to use the same column types, and to include an entry for each preexisting column:

ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...);

To add columns to an existing table:

ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING DEFAULT 'def val');

Note that a change in the schema (such as the adding of the columns), preserves the schema for the old partitions of the table in case it is a partitioned table. All the queries that access these columns and run over the old partitions implicitly return a null value or the specified default values for these columns.

In the later versions we can make the behavior of assuming certain values as opposed to throwing an error in case the column is not found in a particular partition configurable.

Dropping Tables and PartitionsDropping tables is fairly trivial. A drop on the table would implicitly drop any indexes(this is a future feature) that would have been built on the table. The associated command is

DROP TABLE pv_users;

To dropping a partition. Alter the table to drop the partition.

ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08') Note that any data for this table or partitions will be dropped and may not be recoverable. *

This is the Hive Language Manual.

Hive CLI Variable Substitution

Data Types Data Definition Statements Data Manipulation Statements Select

Page 38: Apache Hive

Group By Sort/Distribute/Cluster/Order By Transform and Map-Reduce Scripts Operators and User-Defined Functions XPath-specific Functions

Joins Lateral View Union Sub Queries Sampling Explain Virtual Columns Locks Import/Export Configuration Properties Authorization Statistics Archiving

Child Pages (17)   Hide Child Pages  |  Reorder Pages

Page: LanguageManual Cli Page: LanguageManual DDL Page: LanguageManual DML Page: LanguageManual Select Page: LanguageManual Joins Page: LanguageManual LateralView Page: LanguageManual Union Page: LanguageManual SubQueries Page: LanguageManual Sampling Page: LanguageManual Explain Page: LanguageManual VirtualColumns Page: Configuration Properties Page: LanguageManual ImportExport Page: LanguageManual Authorization Page: LanguageManual Types Page: Literals Page: LanguageManual VariableSubstitution 

Hive Operators and Functions Hive Plug-in Interfaces - User-Defined Functions and SerDes

Reflect UDF Guide to Hive Operators and Functions

Functions for Statistics and Data Mining

Page 39: Apache Hive

Hive Web Interface

What is the Hive Web InterfaceThe Hive web interface is a an alternative to using the Hive command line interface. Using the web interface is a great way to get started with hive.

Features

Schema Browsing

An alternative to running 'show tables' or 'show extended tables' from the CLI is to use the web based schema browser. The Hive Meta Data is presented in a hierarchical manner allowing you to start at the database level and click to get information about tables including the SerDe, column names, and column types.

Detached query execution

A power user issuing multiple hive queries simultaneously would have multiple CLI windows open. The hive web interface manages the session on the web server, not from inside the CLI window. This allows a user to start multiple queries and return to the web interface later to check the status.

No local installation

Any user with a web browser can work with Hive. This has the usual web interface benefits, In particular a user wishing to interact with hadoop or hive requires access to many ports. A remote or VPN user would only require access to the hive web interface running by default on 0.0.0.0 tcp/9999.

ConfigurationHive Web Interface made its first appearance in the 0.2 branch. If you have 0.2 or the SVN trunk you already have it.

You should not need to edit the defaults for the Hive web interface. HWI uses:

<property> <name>hive.hwi.listen.host</name> <value>0.0.0.0</value> <description>This is the host address the Hive Web Interface will listen on</description></property>

<property>

Page 40: Apache Hive

<name>hive.hwi.listen.port</name> <value>9999</value> <description>This is the port the Hive Web Interface will listen on</description></property>

<property> <name>hive.hwi.war.file</name> <value>${HIVE_HOME}/lib/hive_hwi.war</value> <description>This is the WAR file with the jsp content for Hive Web Interface</description></property>

You probably want to setup HiveDerbyServerMode to allow multiple sessions at the same time.

Start up

When initializing Hive with no arguments that CLI is invoked. Hive has an extension architecture used to start other hive demons.Jetty requires apache ant to start HWI. You should define ANT_LIB as an environment variable or add that to the hive invocation.

export ANT_LIB=/opt/ant/libbin/hive --service hwi

Java has no direct way of demonizing. In a production environment you should create a wrapper script.

nohup bin/hive --service hwi > /dev/null 2> /dev/null &

If you want help on the service invocation or list of parameters you can add

bin/hive --service hwi --help

Authentication

Hadoop currently uses environmental properties to determine user name and group vector. Thus Hive and Hive Web Interface can not enforce more stringent security then Hadoop can. When you first connect to the Hive Web Interface the user is prompted for a user name and groups. This feature was added to support installations using different schedulers.

If you want to tighten up security you are going to need to patch the source Hive Session Manager or you may be able to tweak the JSP to accomplish this.

Accessing

In order to access the Hive Web Interface, go to <Hive Server Address>:9999/hwi on your web browser.

Tips and tricks

Page 41: Apache Hive

Result file

The result file is local to the web server. A query that produces massive output should set the result file to /dev/null.

Debug Mode

The debug mode is used when the user is interested in having the result file not only contain the result of the hive query but the other messages.

Set Processor

In the CLI a command like 'SET x=5' is not processed by the the Query Processor it is processed bythe Set Processor. Use the form 'x=5' not 'set x=5'

Walk through

Authorize

Unable to render embedded object: File (1_hwi_authorize.png) not found.Unable to render embedded object: File (2_hwi_authorize.png) not found.

Schema Browser

Unable to render embedded object: File (3_schema_table.png) not found.Unable to render embedded object: File (4_schema_browser.png) not found.

Diagnostics

Unable to render embedded object: File (5_diagnostic.png) not found.

Running a query

Unable to render embedded object: File (6_newsession.png) not found.Unable to render embedded object: File (7_session_runquery.png) not found.Unable to render embedded object: File (8_session_query_1.png) not found.Unable to render embedded object: File (9_file_view.png) not found.

Command Line JDBC

JDBC Client Sample Code Running the JDBC Sample Code

JDBC Client Setup for a Secure Cluster

Page 42: Apache Hive

Python PHP Thrift Java Client ODBC Thrift C++ Client

This page describes the different clients supported by Hive. The command line client currently only supports an embedded server. The JDBC and thrift-java clients support both embedded and standalone servers. Clients in other languages only support standalone servers. For details about the standalone server see Hive Server.

Command LineOperates in embedded mode only, i.e., it needs to have access to the hive libraries. For more details see Getting Started.

JDBCFor embedded mode, uri is just "jdbc:hive://". For standalone server, uri is "jdbc:hive://host:port/dbname" where host and port are determined by where the hive server is run. For example, "jdbc:hive://localhost:10000/default". Currently, the only dbname supported is "default".

JDBC Client Sample Codeimport java.sql.SQLException;import java.sql.Connection;import java.sql.ResultSet;import java.sql.Statement;import java.sql.DriverManager;

public class HiveJdbcClient { private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

/** * @param args * @throws SQLException */ public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", ""); Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable";

Page 43: Apache Hive

stmt.executeQuery("drop table " + tableName); ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value string)"); // show tables String sql = "show tables '" + tableName + "'"; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); if (res.next()) { System.out.println(res.getString(1)); } // describe table sql = "describe " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1) + "\t" + res.getString(2)); }

// load data into table // NOTE: filepath has to be local to the hive server // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line String filepath = "/tmp/a.txt"; sql = "load data local inpath '" + filepath + "' into table " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql);

// select * query sql = "select * from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2)); }

// regular hive query sql = "select count(1) from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1)); } }}

Running the JDBC Sample Code# Then on the command-line$ javac HiveJdbcClient.java

# To run the program in standalone mode, we need the following jars in the classpath# from hive/build/dist/lib# hive_exec.jar# hive_jdbc.jar# hive_metastore.jar

Page 44: Apache Hive

# hive_service.jar# libfb303.jar# log4j-1.2.15.jar## from hadoop/build# hadoop-*-core.jar## To run the program in embedded mode, we need the following additional jars in the classpath# from hive/build/dist/lib# antlr-runtime-3.0.1.jar# derby.jar# jdo2-api-2.1.jar# jpox-core-1.2.2.jar# jpox-rdbms-1.2.2.jar## as well as hive/build/dist/conf

$ java -cp $CLASSPATH HiveJdbcClient

# Alternatively, you can run the following bash script, which will seed the data file# and build your classpath before invoking the client.

#!/bin/bashHADOOP_HOME=/your/path/to/hadoopHIVE_HOME=/your/path/to/hive

echo -e '1\x01foo' > /tmp/a.txtecho -e '2\x01bar' >> /tmp/a.txt

HADOOP_CORE={{ls $HADOOP_HOME/hadoop-*-core.jar}}CLASSPATH=.:$HADOOP_CORE:$HIVE_HOME/conf

for i in ${HIVE_HOME}/lib/*.jar ; do CLASSPATH=$CLASSPATH:$idone

java -cp $CLASSPATH HiveJdbcClient

JDBC Client Setup for a Secure ClusterTo configure Hive on a secure cluster, add the directory containing hive-site.xml to the CLASSPATH of the JDBC client.

PythonOperates only on a standalone server. Set (and export) PYTHONPATH to build/dist/lib/py.

The python modules imported in the code below are generated by building hive.

Please note that the generated python module names have changed in hive trunk.

Page 45: Apache Hive

#!/usr/bin/env python

import sys

from hive import ThriftHivefrom hive.ttypes import HiveServerExceptionfrom thrift import Thriftfrom thrift.transport import TSocketfrom thrift.transport import TTransportfrom thrift.protocol import TBinaryProtocol

try: transport = TSocket.TSocket('localhost', 10000) transport = TTransport.TBufferedTransport(transport) protocol = TBinaryProtocol.TBinaryProtocol(transport)

client = ThriftHive.Client(protocol) transport.open()

client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)") client.execute("LOAD TABLE LOCAL INPATH '/path' INTO TABLE r") client.execute("SELECT * FROM r") while (1): row = client.fetchOne() if (row == None): break print row client.execute("SELECT * FROM r") print client.fetchAll()

transport.close()

except Thrift.TException, tx: print '%s' % (tx.message)

PHPOperates only on a standalone server.

<?php// set THRIFT_ROOT to php directory of the hive distribution$GLOBALS['THRIFT_ROOT'] = '/lib/php/';// load the required files for connecting to Hiverequire_once $GLOBALS['THRIFT_ROOT'] . 'packages/hive_service/ThriftHive.php';require_once $GLOBALS['THRIFT_ROOT'] . 'transport/TSocket.php';require_once $GLOBALS['THRIFT_ROOT'] . 'protocol/TBinaryProtocol.php';// Set up the transport/protocol/client$transport = new TSocket('localhost', 10000);$protocol = new TBinaryProtocol($transport);$client = new ThriftHiveClient($protocol);$transport->open();

// run queries, metadata calls etc$client->execute('SELECT * from src');

Page 46: Apache Hive

var_dump($client->fetchAll());$transport->close();

Thrift Java ClientOperates both in embedded mode and on standalone server.

ODBCOperates only on a standalone server. See Hive ODBC.

Thrift C++ ClientOperates only on a standalone server. In the works.

Beeline - New Command Line shell JDBC

JDBC Client Sample Code Running the JDBC Sample Code

JDBC Client Setup for a Secure Cluster

This page describes the different clients supported by HiveServer2.

Beeline - New Command Line shellHiveServer2 supports a new command shell Beeline that works with HiveServer2. Its a JDBC client that is based on SQLLine CLI (http://sqlline.sourceforge.net/). There’s an detailed documentation   of the SQLLine which is applicable to Beeline as well. The Beeline shell works in the both embedded as well as remote mode. In the embedded mode, it runs an embedded Hive (similar to Hive CLI) where are remote mode is for connecting to a separate HiveServer2 process over Thrift.

Example -

% bin/beelineHive version 0.11.0-SNAPSHOT by Apachebeeline> !connect jdbc:hive2://localhost:10000 scott tiger org.apache.hive.jdbc.HiveDriver!connect jdbc:hive2://localhost:10000 scott tiger org.apache.hive.jdbc.HiveDriverConnecting to jdbc:hive2://localhost:10000Connected to: Hive (version 0.10.0)Driver: Hive (version 0.10.0-SNAPSHOT)Transaction isolation: TRANSACTION_REPEATABLE_READ0: jdbc:hive2://localhost:10000> show tables;show tables;

Page 47: Apache Hive

+-------------------+| tab_name |+-------------------+| primitives || src || src1 || src_json || src_sequencefile || src_thrift || srcbucket || srcbucket2 || srcpart |+-------------------+9 rows selected (1.079 seconds)

JDBCHiveServere2 has a new JDBC driver. It supports both embedded and remote access to HiveServer2.The JDBC connection URL format has prefix is jdbc:hive2:// and the Driver class is org.apache.hive.jdbc.HiveDriver. Note that this is different from the old hiveserver. For remote server, the URL format is jdbc:hive2://<host>:<port>/<db> (default port for HiveServer2 is 10000). For embedded server, the URL format is jdbc:hive2:// (no host or port).

JDBC Client Sample Code

import java.sql.SQLException;import java.sql.Connection;import java.sql.ResultSet;import java.sql.Statement;import java.sql.DriverManager;

public class HiveJdbcClient { private static String driverName = "org.apache.hive.jdbc.HiveDriver";

/** * @param args * @throws SQLException */ public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } //replace "hive" here with the name of the user the queries should run as Connection con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "hive", "");

Page 48: Apache Hive

Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable"; stmt.execute("drop table if exists " + tableName); stmt.execute("create table " + tableName + " (key int, value string)"); // show tables String sql = "show tables '" + tableName + "'"; System.out.println("Running: " + sql); ResultSet res = stmt.executeQuery(sql); if (res.next()) { System.out.println(res.getString(1)); } // describe table sql = "describe " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1) + "\t" + res.getString(2)); }

// load data into table // NOTE: filepath has to be local to the hive server // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line String filepath = "/tmp/a.txt"; sql = "load data local inpath '" + filepath + "' into table " + tableName; System.out.println("Running: " + sql); stmt.execute(sql);

// select * query sql = "select * from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2)); }

// regular hive query sql = "select count(1) from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1)); } }}

Running the JDBC Sample Code{noformat}# Then on the command-line$ javac HiveJdbcClient.java

# To run the program in standalone mode, we need the following jars in the classpath# from hive/build/dist/lib

Page 49: Apache Hive

# hive-jdbc*.jar# hive-service*.jar# libfb303-0.9.0.jar#  libthrift-0.9.0.jar# log4j-1.2.16.jar# slf4j-api-1.6.1.jar# slf4j-log4j12-1.6.1.jar# commons-logging-1.0.4.jar### Following additional jars are needed for the kerberos secure mode -# hive-exec*.jar# commons-configuration-1.6.jar# and from hadoop - hadoop-*core.jar# To run the program in embedded mode, we need the following additional jars in the classpath# from hive/build/dist/lib# hive-exec*.jar# hive-metastore*.jar# antlr-runtime-3.0.1.jar# derby.jar# jdo2-api-2.1.jar# jpox-core-1.2.2.jar# jpox-rdbms-1.2.2.jar## from hadoop/build# hadoop-*-core.jar# as well as hive/build/dist/conf, any HIVE_AUX_JARS_PATH set, and hadoop jars necessary to run MR jobs (eg lzo codec)

$ java -cp $CLASSPATH HiveJdbcClient

# Alternatively, you can run the following bash script, which will seed the data file# and build your classpath before invoking the client. The script adds all the # additional jars needed for using HiveServer2 in embedded mode as well.

#!/bin/bashHADOOP_HOME=/your/path/to/hadoopHIVE_HOME=/your/path/to/hive

echo -e '1\x01foo' > /tmp/a.txtecho -e '2\x01bar' >> /tmp/a.txt

HADOOP_CORE={{ls $HADOOP_HOME/hadoop-*-core.jar}}CLASSPATH=.:$HIVE_HOME/conf:`hadoop classpath`

for i in ${HIVE_HOME}/lib/*.jar ; do CLASSPATH=$CLASSPATH:$idone

java -cp $CLASSPATH HiveJdbcClient

{noformat}

JDBC Client Setup for a Secure Cluster

Page 50: Apache Hive

When connecting to HiveServer2 with kerberos authentication, the URL format is jdbc:hive2://<host>:<port>/<db>;principal=<Server_Principal_of_HiveServer2>. The client needs to have a valid Kerberos ticket in the ticket cache before connecting.

In case of LDAP or customer pass through authentication, the client needs to pass the valid user name and password to JDBC connection API.

This page documents changes that are visible to users.

Hive Trunk (0.8.0-dev) Hive 0.7.1 Hive 0.7.0 Hive 0.6.0 Hive 0.5.0

Hive Trunk (0.8.0-dev)

Hive 0.7.1

Hive 0.7.0 HIVE-1790: Add support for HAVING clause.

Hive 0.6.0

Hive 0.5.0

Earliest version AvroSerde is availableThe AvroSerde is available in Hive 0.9.1 and greater.

Overview - Working with Avro from Hive

The AvroSerde allows users to read or write Avro data as Hive tables. The AvroSerde's bullet points:

Infers the schema of the Hive table from the Avro schema. Reads all Avro files within a table against a specified schema, taking advantage of Avro's

backwards compatibility abilities Supports arbitrarily nested schemas.

Page 51: Apache Hive

Translates all Avro data types into equivalent Hive types. Most types map exactly, but some Avro types don't exist in Hive and are automatically converted by the AvroSerde.

Understands compressed Avro files. Transparently converts the Avro idiom of handling nullable types as Union[T, null] into just T and

returns null when appropriate. Writes any Hive table to Avro files. Has worked reliably against our most convoluted Avro schemas in our ETL process.

Requirements

The AvroSerde has been built and tested against Hive 0.9.1 and Avro 1.5.

Avro to Hive type conversion

While most Avro types convert directly to equivalent Hive types, there are some which do not exist in Hive and are converted to reasonable equivalents. Also, the AvroSerde special cases unions of null and another type, as described below:

Avro type

Becomes Hive type

Note

null void

boolean boolean

int int

long bigint

float float

double double

bytes Array[smallint] Hive converts these to signed bytes.

string string

record struct

map map

list array

Page 52: Apache Hive

union union Unions of [T, null] transparently convert to nullable T, other types translate directly to Hive's unions of those types. However, unions were introduced in Hive 7 and are not currently able to be used in where/group-by statements. They are essentially look-at-only. Because the AvroSerde transparently converts [T,null], to nullable T, this limitation only applies to unions of multiple types or unions not of a single type and null.

enum string Hive has no concept of enums

fixed Array[smallint] Hive converts the bytes to signed int

Creating Avro-backed Hive tables

To create a the Avro-backed table, specify the serde as org.apache.hadoop.hive.serde2.avro.AvroSerDe, specify the inputformat as org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat, and the outputformat as org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat. Also provide a location from which the AvroSerde will pull the most current schema for the table. For example:

CREATE TABLE kst PARTITIONED BY (ds string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='http://schema_provider/kst.avsc');

In this example we're pulling the source-of-truth reader schema from a webserver. Other options for providing the schema are described below.Add the Avro files to the database (or create an external table) using [standard Hive operations](http://wiki.apache.org/hadoop/Hive/LanguageManual/DML).This table might result in a description as below:

hive> describe kst; OK string1 string from deserializer string2 string from deserializer int1 int from deserializer boolean1 boolean from deserializer long1 bigint from deserializer float1 float from deserializer double1 double from deserializer inner_record1 struct<int_in_inner_record1:int,string_in_inner_record1:string> from deserializer enum1 string from deserializer array1 array<string> from deserializer map1 map<string,string> from deserializer union1 uniontype<float,boolean,string> from deserializer fixed1 array<tinyint> from deserializer

Page 53: Apache Hive

null1 void from deserializer unionnullint int from deserializer bytes1 array<tinyint> from deserializer

At this point, the Avro-backed table can be worked with in Hive like any other table.

Writing tables to Avro files

The AvroSerde can serialize any Hive table to Avro files. This makes it effectively an any-Hive-type to Avro converter. In order to write a table to an Avro file, you must first create an appropriate Avro schema. Create as select type statements are not currently supported. Types translate as detailed in the table above. For types that do not translate directly, there are a few items to keep in mind:

Types that may be null must be defined as a union of that type and Null within Avro. A null in a field that is not so defined with result in an exception during the save. No changes need be made to the Hive schema to support this, as all fields in Hive can be null.

Avro Bytes type should be defined in Hive as lists of tiny ints. the AvroSerde will convert these to Bytes during the saving process.

Avro Fixed type should be defined in Hive as lists of tiny ints. the AvroSerde will convert these to Fixed during the saving process.

Avro Enum type should be defined in Hive as strings, since Hive doesn't have a concept of enums. Ensure that only valid enum values are present in the table - trying to save a non-defined enum will result in an exception.

Example

Consider the following Hive table, which coincidentally covers all types of Hive data types, making it a good example:

CREATE TABLE test_serializer(string1 STRING, int1 INT, tinyint1 TINYINT, smallint1 SMALLINT, bigint1 BIGINT, boolean1 BOOLEAN, float1 FLOAT, double1 DOUBLE, list1 ARRAY<STRING>, map1 MAP<STRING,INT>, struct1 STRUCT<sint:INT,sboolean:BOOLEAN,sstring:STRING>, union1 uniontype<FLOAT, BOOLEAN, STRING>, enum1 STRING, nullableint INT, bytes1 ARRAY<TINYINT>, fixed1 ARRAY<TINYINT>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS TERMINATED BY '#' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

To save this table as an Avro file, create an equivalent Avro schema (the namespace and actual name of the record are not important):

Page 54: Apache Hive

{ "namespace": "com.linkedin.haivvreo", "name": "test_serializer", "type": "record", "fields": [ { "name":"string1", "type":"string" }, { "name":"int1", "type":"int" }, { "name":"tinyint1", "type":"int" }, { "name":"smallint1", "type":"int" }, { "name":"bigint1", "type":"long" }, { "name":"boolean1", "type":"boolean" }, { "name":"float1", "type":"float" }, { "name":"double1", "type":"double" }, { "name":"list1", "type":{"type":"array", "items":"string"} }, { "name":"map1", "type":{"type":"map", "values":"int"} }, { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [ { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } }, { "name":"union1", "type":["float", "boolean", "string"] }, { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} }, { "name":"nullableint", "type":["int", "null"] }, { "name":"bytes1", "type":"bytes" }, { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} } ] }

If the table were backed by a csv such as:

why hello there

42

3 100

1412341

true

42.43

85.23423424

alpha:beta:gamma

Earth#42:Control#86:Bob#31

17:true:Abe Linkedin

0:3.141459

BLUE

72 0:1:2:3:4:5

50:51:53

another record

98

4 101

9999999

false

99.89

0.00000009

beta Earth#101 1134:false:wazzup

1:true

RED

NULL

6:7:8:9:10

54:55:56

third record

45

5 102

999999999

true

89.99

0.00000000000009

alpha:gamma

Earth#237:Bob#723

102:false:BNL

2:Time to go home

GREEN

NULL

11:12:13

57:58:59

one can write it out to Avro with:

CREATE TABLE as_avro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT

Page 55: Apache Hive

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='file:///path/to/the/schema/test_serializer.avsc');insert overwrite table as_avro select * from test_serializer;

The files that are written by the Hive job are valid Avro files, however, MapReduce doesn't add the standard .avro extension. If you copy these files out, you'll likely want to rename them with .avro.

Hive is very forgiving about types: it will attempt to store whatever value matches the provided column in the equivalent column position in the new table. No matching is done on column names, for instance. Therefore, it is incumbent on the query writer to make sure the the target column types are correct. If they are not, Avro may accept the type or it may throw an exception, this is dependent on the particular combination of types.

Specifying the Avro schema for a table

There are three ways to provide the reader schema for an Avro table, all of which involve parameters to the serde. As the schema involves, one can update these values by updating the parameters in the table.

Use avro.schema.url

Specifies a url to access the schema from. For http schemas, this works for testing and small-scale clusters, but as the schema will be accessed at least once from each task in the job, this can quickly turn the job into a DDOS attack against the URL provider (a web server, for instance). Use caution when using this parameter for anything other than testing.

The schema can also point to a location on HDFS, for instance: hdfs://your-nn:9000/path/to/avsc/file. the AvroSerde will then read the file from HDFS, which should provide resiliency against many reads at once. Note that the serde will read this file from every mapper, so it's a good idea to turn the replication of the schema file to a high value to provide good locality for the readers. The schema file itself should be relatively small, so this does not add a significant amount of overhead to the process.

Use schema.literal and embed the schema in the create statement

One can embed the schema directly into the create statement. This works if the schema doesn't have any single quotes (or they are appropriately escaped), as Hive uses this to define the parameter value. For instance:

CREATE TABLE embedded COMMENT "just drop the schema right into the HQL" ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.literal'='{ "namespace": "com.howdy", "name": "some_schema", "type": "record",

Page 56: Apache Hive

"fields": [ { "name":"string1","type":"string"}] }');

Note that the value is enclosed in single quotes and just pasted into the create statement.

Use avro.schema.literal and pass the schema into the script

Hive can do simple variable substitution and one can pass the schema embedded in a variable to the script. Note that to do this, the schema must be completely escaped (carriage returns converted to \n, tabs to \t, quotes escaped, etc). An example:

set hiveconf:schema;DROP TABLE example;CREATE TABLE example ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.literal'='${hiveconf:schema}');

To execute this script file, assuming $SCHEMA has been defined to be the escaped schema value:

hive -hiveconf schema="${SCHEMA}" -f your_script_file.sql

Note that $SCHEMA is interpolated into the quotes to correctly handle spaces within the schema.

Use none to ignore either avro.schema.literal or avro.schema.url

Hive does not provide an easy way to unset or remove a property. If you wish to switch from using url or schema to the other, set the to-be-ignored value to none and the AvroSerde will treat it as if it were not set.

If something goes wrong

Hive tends to swallow exceptions from the AvroSerde that occur before job submission. To force Hive to be more verbose, it can be started with *hive -hiveconf hive.root.logger=INFO,console*, which will spit orders of magnitude more information to the console and will likely include any information the AvroSerde is trying to get you about what went wrong. If the AvroSerde encounters an error during MapReduce, the stack trace will be provided in the failed task log, which can be examined from the JobTracker's web interface. the AvroSerde only emits the AvroSerdeException; look for these. Please include these in any bug reports. The most common is expected to be exceptions while attempting to serializing an incompatible type from what Avro is expecting.

FAQ Why do I get error-error-error-error-error-error-error and a message to check

avro.schema.literal and avro.schema.url when describing a table or running a query against a table?

Page 57: Apache Hive

The AvroSerde returns this message when it has trouble finding or parsing the schema provided by either the avro.schema.literal or avro.avro.schema.url value. It is unable to be more specific because Hive expects all calls to the serde config methods to be successful, meaning we are unable to return an actual exception. By signaling an error via this message, the table is left in a good state and the incorrect value can be corrected with a call to alter table T set TBLPROPERTIES.

nstalling HiveInstalling Hive is simple and only requires having Java 1.6 and Ant installed on your machine.

Hive is available via SVN at http://svn.apache.org/repos/asf/hive/trunk. You can download it by running the following command.

$ svn co http://svn.apache.org/repos/asf/hive/trunk hive

To build hive, execute the following command on the base directory:

$ ant package

It will create the subdirectory build/dist with the following contents:

README.txt: readme file. bin/: directory containing all the shell scripts lib/: directory containing all required jar files) conf/: directory with configuration files examples/: directory with sample input and query files

Subdirectory build/dist should contain all the files necessary to run hive. You can run it from there or copy it to a different location, if you prefer.

In order to run Hive, you must have hadoop in your path or have defined the environment variable HADOOP_HOME with the hadoop installation directory.

Moreover, we strongly advise users to create the HDFS directories /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w before tables are created in Hive.

To use hive command line interface (cli) go to the hive home directory (the one with the contents of build/dist) and execute the following command:

$ bin/hive

Metadata is stored in an embedded Derby database whose disk storage location is determined by the hive configuration variable named javax.jdo.option.ConnectionURL. By default (see conf/hive-default.xml), this location is ./metastore_db

Using Derby in embedded mode allows at most one user at a time. To configure Derby to run in server mode, look at HiveDerbyServerMode.

Configuring Hive

Page 58: Apache Hive

A number of configuration variables in Hive can be used by the administrator to change the behavior for their installations and user sessions. These variables can be configured in any of the following ways, shown in the order of preference:

Using the set command in the cli for setting session level values for the configuration variable for all statements subsequent to the set command. e.g.

set hive.exec.scratchdir=/tmp/mydir;sets the scratch directory (which is used by hive to store temporary output and plans) to /tmp/mydir for all subseq

Using -hiveconf option on the cli for the entire session. e.g. bin/hive -hiveconf hive.exec.scratchdir=/tmp/mydir In hive-site.xml. This is used for setting values for the entire Hive configuration. e.g. <property> <name>hive.exec.scratchdir</name> <value>/tmp/mydir</value> <description>Scratch space for Hive jobs</description> </property>

hive-default.xml.template contains the default values for various configuration variables that come with prepackaged in a Hive distribution. In order to override any of the values, create hive-site.xml instead and set the value in that file as shown above. Please note that this file is not used by Hive at all (as of Hive 0.9.0) and so it might be out of date or out of sync with the actual values. The canonical list of configuration options is now only managed in the HiveConf java class.

hive-default.xml.template is located in the conf directory in your installation root. hive-site.xml should also be created in the same directory.

Broadly the configuration variables are categorized into:

Hive Configuration Variables

Variable Name Description Default Value

hive.ddl.output.format The data format to use for DDL output (e.g. DESCRIBE table). One of "text" (for human readable text) or "json" (for a json object). (as of Hive 0.9.0)

text

hive.exec.script.wrapper Wrapper around any invocations to script operator e.g. if this is set to python, the script passed to the script operator will be invoked as python <script command>. If the value is null or not set, the script is invoked as <script command>.

null

hive.exec.plan null

hive.exec.scratchdir This directory is used by hive to store the plans for different map/reduce

/tmp/<user.name>/hive

Page 59: Apache Hive

stages for the query as well as to stored the intermediate outputs of these stages.

hive.exec.submitviachild Determines whether the map/reduce jobs should be submitted through a separate jvm in the non local mode.

false - By default jobs are submitted through the same jvm as the compiler

hive.exec.script.maxerrsize Maximum number of serialization errors allowed in a user script invoked through TRANSFORM or MAP or REDUCE constructs.

100000

hive.exec.compress.output Determines whether the output of the final map/reduce job in a query is compressed or not.

false

hive.exec.compress.intermediate

Determines whether the output of the intermediate map/reduce jobs in a query is compressed or not.

false

hive.jar.path The location of hive_cli.jar that is used when submitting jobs in a separate jvm.

hive.aux.jars.path The location of the plugin jars that contain implementations of user defined functions and serdes.

hive.partition.pruning A strict value for this variable indicates that an error is thrown by the compiler in case no partition predicate is provided on a partitioned table. This is used to protect against a user inadvertently issuing a query against all the partitions of the table.

nonstrict

hive.map.aggr Determines whether the map side aggregation is on or not.

true

hive.join.emit.interval 1000

hive.map.aggr.hash.percentmemory

(float)0.5

hive.default.fileformat Default file format for CREATE TABLE statement. Options are TextFile,

TextFile

Page 60: Apache Hive

SequenceFile and RCFile

hive.merge.mapfiles Merge small files at the end of a map-only job.

true

hive.merge.mapredfiles Merge small files at the end of a map-reduce job.

false

hive.merge.size.per.task Size of merged files at the end of the job.

256000000

hive.merge.smallfiles.avgsize

When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

16000000

hive.querylog.enable.plan.progress

Whether to log the plan's progress every time a job's progress is checked. These logs are written to the location specified byhive.querylog.location (as of Hive 0.10)

true

hive.querylog.location Directory where structured hive query logs are created. One file per session is created in this directory. If this variable set to empty string structured log will not be created.

/tmp/<user.name>

hive.querylog.plan.progress.interval

The interval to wait between logging the plan's progress in milliseconds. If there is a whole number percentage change in the progress of the mappers or the reducers, the progress is logged regardless of this value. The actual interval will be the ceiling of (this value divided by the value of hive.exec.counters.pull.interval) multiplied by the value ofhive.exec.counters.pull.interval i.e. if it is not divide evenly by the value of hive.exec.counters.pull.in

60000

Page 61: Apache Hive

tervalit will be logged less frequently than specified. This only has an effect if hive.querylog.enable.plan.progress is set totrue. (as of Hive 0.10)

hive.stats.autogather A flag to gather statistics automatically during the INSERT OVERWRITE command. (as of Hive 0.7.0)

true

hive.stats.dbclass The default database that stores temporary hive statistics. Valid values are hbase and jdbc while jdbc should have a specification of the Database to use, separatey by a colon (e.g. jdbc:mysql (as of Hive 0.7.0)

jdbc:derby

hive.stats.dbconnectionstring

The default connection string for the database that stores temporary hive statistics. (as of Hive 0.7.0)

jdbc:derby:;databaseName=TempStatsStore;create=true

hive.stats.jdbcdriver The JDBC driver for the database that stores temporary hive statistics. (as of Hive 0.7.0)

org.apache.derby.jdbc.EmbeddedDriver

hive.stats.reliable Whether queries will fail because stats cannot be collected completely accurately. If this is set to true, reading/writing from/into a partition may fail becuase the stats could not be computed accurately (as of Hive 0.10.0)

false

hive.enforce.bucketing If enabled, enforces inserts into bucketed tables to also be bucketed

false

hive.variable.substitute Substitutes variables in Hive statements which were previously set using the set command, system variables or environment variables. See HIVE-1096 for details. (as of Hive 0.7.0)

true

hive.variable.substitute.depth

The maximum replacements the substitution engine will do. (as of Hive 0.10.0)

40

Page 62: Apache Hive

Hive Metastore Configuration Variables

Please see the Admin Manual's section on the Metastore for details.

Hive Configuration Variables used to interact with Hadoop

Variable Name Description Default Value

hadoop.bin.path The location of hadoop script which is used to submit jobs to hadoop when submitting through a separate jvm.

$HADOOP_HOME/bin/hadoop

hadoop.config.dir The location of the configuration directory of the hadoop installation

$HADOOP_HOME/conf

fs.default.name file:///

map.input.file null

mapred.job.tracker The url to the jobtracker. If this is set to local then map/reduce is run in the local mode.

local

mapred.reduce.tasks The number of reducers for each map/reduce stage in the query plan.

1

mapred.job.name The name of the map/reduce job null

Hive Variables used to pass run time information

Variable Name Description Default Value

hive.session.id The id of the Hive Session.

hive.query.string The query string passed to the map/reduce job.

hive.query.planid The id of the plan for the map/reduce stage.

hive.jobname.length The maximum length of the jobname. 50

hive.table.name The name of the hive table. This is passed to the user scripts through the script operator.

hive.partition.name The name of the hive partition. This is passed to the user scripts through the

Page 63: Apache Hive

script operator.

hive.alias The alias being processed. This is also passed to the user scripts through the script operator.

Temporary Folders

Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows:

On the HDFS cluster this is set to /tmp/hive-<username> by default and is controlled by the configuration variable hive.exec.scratchdir

On the client machine, this is hardcoded to /tmp/<username>

Note that when writing data to a table/partition, Hive will first write to a temporary location on the target table's filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the target table. This applies in all cases - whether tables are stored in HDFS (normal case) or in file systems like S3 or even NFS.

Log Files

Hive client produces logs and history files on the client machine. Please see Error Logs on configuration details.

Introduction Embedded Metastore Local Metastore Remote Metastore

Introduction

All the metadata for Hive tables and partitions are stored in Hive Metastore. Metadata is persisted using JPOX ORM solution so any store that is supported by it. Most of the commercial relational databases and many open source datstores are supported. Any datastore that has JDBC driver can probably be used.

You can find an E/R diagram for the metastore here.

There are 3 different ways to setup metastore server using different Hive configurations. The relevant configuration parameters are

Config Param Description

javax.jdo.option.ConnectionURL JDBC connection string for the data store which contains metadata

Page 64: Apache Hive

javax.jdo.option.ConnectionDriverName JDBC Driver class name for the data store which contains metadata

hive.metastore.uris Hive connects to this URI to make metadata requests for a remote metastore

hive.metastore.local local or remote metastore (Removed as of Hive 0.10: If hive.metastore.uris is empty local mode is assumed, remote otherwise)

hive.metastore.warehouse.dir URI of the default location for native tables

These variables were carried over from old documentation without a guarantee that they all still exist:

Variable Name Description Default Value

hive.metastore.metadb.dir

hive.metastore.usefilestore

hive.metastore.rawstore.impl

org.jpox.autoCreateSchema Creates necessary schema on startup if one doesn't exist. (e.g. tables, columns...) Set to false after creating it once.

org.jpox.fixedDatastore Whether the datastore schema is fixed.

hive.metastore.checkForDefaultDb

hive.metastore.ds.connection.url.hook Name of the hook to use for retriving the JDO connection URL. If empty, the value in javax.jdo.option.ConnectionURL is used as the connection URL

hive.metastore.ds.retry.attempts The number of times to retry a call to the backing datastore if there were a connection error

1

hive.metastore.ds.retry.interval The number of miliseconds between datastore retry attempts 1000

hive.metastore.server.min.threads Minimum number of worker threads in the Thrift server's pool. 200

hive.metastore.server.max.threads Maximum number of worker threads in the Thrift server's pool. 10000

Default configuration sets up an embedded metastore which is used in unit tests and is described in the next section. More practical options are described in the subsequent sections.

Page 65: Apache Hive

Embedded Metastore

Mainly used for unit tests and only one process can connect to metastore at a time. So it is not really a practical solution but works well for unit tests.

Config Param Config Value Comment

javax.jdo.option.ConnectionURL

jdbc:derby:;databaseName=../build/test/junit_metastore_db;create=true

derby database located at hive/trunk/build...

javax.jdo.option.ConnectionDriverName

org.apache.derby.jdbc.EmbeddedDriver Derby embeded JDBC driver class

hive.metastore.uris not needed since this is a local metastore

hive.metastore.local true embeded is local

hive.metastore.warehouse.dir

file://${user.dir}/../build/ql/test/data/warehouse

unit test data goes in here on your local filesystem

If you want to run the metastore as a network server so it can be accessed from multiple nodes try HiveDerbyServerMode.

Local Metastore

In local metastore setup, each Hive Client will open a connection to the datastore and make SQL queries against it. The following config will setup a metastore in a MySQL server. Make sure that the server accessible from the machines where Hive queries are executed since this is a local store. Also the jdbc client library is in the classpath of Hive Client.

Config Param Config Value Comment

javax.jdo.option.ConnectionURL jdbc:mysql://<host name>/<database name>?createDatabaseIfNotExist=true

metadata is stored in a MySQL server

javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver MySQL JDBC

Page 66: Apache Hive

driver class

javax.jdo.option.ConnectionUserName <user name> user name for connecting to mysql server

javax.jdo.option.ConnectionPassword <password> password for connecting to mysql server

hive.metastore.uris not needed because this is local store

hive.metastore.local true this is local store

hive.metastore.warehouse.dir <base hdfs path> default location for Hive tables.

Remote Metastore

In remote metastore setup, all Hive Clients will make a connection a metastore server which in turn queries the datastore (MySQL in this example) for metadata. Metastore server and client communicate using Thrift Protocol. Starting with Hive 0.5.0, you can start a thrift server by executing the following command:

hive --service metastore

In versions of Hive earlier than 0.5.0, it's instead necessary to run the thrift server via direct execution of Java:

$JAVA_HOME/bin/java -Xmx1024m -Dlog4j.configuration=file://$HIVE_HOME/conf/hms-log4j.properties -Djava.library.path=$HADOOP_HOME/lib/native/Linux-amd64-64/ -cp $CLASSPATH org.apache.hadoop.hive.metastore.HiveMetaStore

If you execute Java directly, then JAVA_HOME, HIVE_HOME, HADOOP_HOME must be correctly set; CLASSPATH should contain Hadoop, Hive (lib and auxlib), and Java jars.

Server Configuration Parameters

Config Param Config Value Comment

javax.jdo.option.ConnectionURL jdbc:mysql://<host name>/<database name>?createDatabaseIfNotExist=true

metadata is stored in a MySQL server

javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver MySQL JDBC driver class

Page 67: Apache Hive

javax.jdo.option.ConnectionUserName <user name> user name for connecting to mysql server

javax.jdo.option.ConnectionPassword <password> password for connecting to mysql server

hive.metastore.warehouse.dir <base hdfs path> default location for Hive tables.

Client Configuration Parameters

Config Param Config Value Comment

hive.metastore.uris thrift://<host_name>:<port> host and port for the thrift metastore server

hive.metastore.local false this is local store

hive.metastore.warehouse.dir <base hdfs path> default location for Hive tables.

If you are using MySQL as the datastore for metadata, put MySQL client libraries in HIVE_HOME/lib before starting Hive Client or HiveMetastore Server.

Hive Web Interface

What is the Hive Web InterfaceThe Hive web interface is a an alternative to using the Hive command line interface. Using the web interface is a great way to get started with hive.

Features

Schema Browsing

An alternative to running 'show tables' or 'show extended tables' from the CLI is to use the web based schema browser. The Hive Meta Data is presented in a hierarchical manner allowing you to start at the database level and click to get information about tables including the SerDe, column names, and column types.

Page 68: Apache Hive

Detached query execution

A power user issuing multiple hive queries simultaneously would have multiple CLI windows open. The hive web interface manages the session on the web server, not from inside the CLI window. This allows a user to start multiple queries and return to the web interface later to check the status.

No local installation

Any user with a web browser can work with Hive. This has the usual web interface benefits, In particular a user wishing to interact with hadoop or hive requires access to many ports. A remote or VPN user would only require access to the hive web interface running by default on 0.0.0.0 tcp/9999.

ConfigurationHive Web Interface made its first appearance in the 0.2 branch. If you have 0.2 or the SVN trunk you already have it.

You should not need to edit the defaults for the Hive web interface. HWI uses:

<property> <name>hive.hwi.listen.host</name> <value>0.0.0.0</value> <description>This is the host address the Hive Web Interface will listen on</description></property>

<property> <name>hive.hwi.listen.port</name> <value>9999</value> <description>This is the port the Hive Web Interface will listen on</description></property>

<property> <name>hive.hwi.war.file</name> <value>${HIVE_HOME}/lib/hive_hwi.war</value> <description>This is the WAR file with the jsp content for Hive Web Interface</description></property>

You probably want to setup HiveDerbyServerMode to allow multiple sessions at the same time.

Start up

When initializing Hive with no arguments that CLI is invoked. Hive has an extension architecture used to start other hive demons.Jetty requires apache ant to start HWI. You should define ANT_LIB as an environment variable or add that to the hive invocation.

export ANT_LIB=/opt/ant/libbin/hive --service hwi

Page 69: Apache Hive

Java has no direct way of demonizing. In a production environment you should create a wrapper script.

nohup bin/hive --service hwi > /dev/null 2> /dev/null &

If you want help on the service invocation or list of parameters you can add

bin/hive --service hwi --help

Authentication

Hadoop currently uses environmental properties to determine user name and group vector. Thus Hive and Hive Web Interface can not enforce more stringent security then Hadoop can. When you first connect to the Hive Web Interface the user is prompted for a user name and groups. This feature was added to support installations using different schedulers.

If you want to tighten up security you are going to need to patch the source Hive Session Manager or you may be able to tweak the JSP to accomplish this.

Accessing

In order to access the Hive Web Interface, go to <Hive Server Address>:9999/hwi on your web browser.

Tips and tricks

Result file

The result file is local to the web server. A query that produces massive output should set the result file to /dev/null.

Debug Mode

The debug mode is used when the user is interested in having the result file not only contain the result of the hive query but the other messages.

Set Processor

In the CLI a command like 'SET x=5' is not processed by the the Query Processor it is processed bythe Set Processor. Use the form 'x=5' not 'set x=5'

Walk through

Page 70: Apache Hive

Authorize

Unable to render embedded object: File (1_hwi_authorize.png) not found.Unable to render embedded object: File (2_hwi_authorize.png) not found.

Schema Browser

Unable to render embedded object: File (3_schema_table.png) not found.Unable to render embedded object: File (4_schema_browser.png) not found.

Diagnostics

Unable to render embedded object: File (5_diagnostic.png) not found.

Running a query

Unable to render embedded object: File (6_newsession.png) not found.Unable to render embedded object: File (7_session_runquery.png) not found.Unable to render embedded object: File (8_session_query_1.png) not found.Unable to render embedded object: File (9_file_view.png) not found.

Setting Up Hive Server Setting up HiveServer2 Setting Up Thrift Hive Server Setting Up Hive JDBC Server Setting Up Hive ODBC Server

Child Pages (1)   Hide Child Pages  |  Reorder Pages

Page: Setting up HiveServer2 

= Hive and Amazon Web Services =

BackgroundThis document explores the different ways of leveraging Hive on Amazon Web Services - namely S3, EC2 and Elastic Map-Reduce.

Hadoop already has a long tradition of being run on EC2 and S3. These are well documented in the links below which are a must read:

Hadoop and S3

Page 71: Apache Hive

Amazon and EC2

The second document also has pointers on how to get started using EC2 and S3. For people who are new to S3 - there's a few helpful notes in S3 for n00bs section below. The rest of the documentation below assumes that the reader can launch a hadoop cluster in EC2, copy files into and out of S3 and run some simple Hadoop jobs.

Introduction to Hive and AWSThere are three separate questions to consider when running Hive on AWS:

1. Where to run the Hive CLI from and store the metastore db (that contains table and schema definitions).

2. How to define Hive tables over existing datasets (potentially those that are already in S3)3. How to dispatch Hive queries (which are all executed using one or more map-reduce programs)

to a Hadoop cluster running in EC2.

We walk you through the choices involved here and show some practical case studies that contain detailed setup and configuration instructions.

Running the Hive CLIThe CLI takes in Hive queries, compiles them into a plan (commonly, but not always, consisting of map-reduce jobs) and then submits them to a Hadoop Cluster. While it depends on Hadoop libraries for this purpose - it is otherwise relatively independent of the Hadoop cluster itself. For this reason the CLI can be run from any node that has a Hive distribution, a Hadoop distribution, a Java Runtime Engine. It can submit jobs to any compatible hadoop cluster (whose version matches that of the Hadoop libraries that Hive is using) that it can connect to. The Hive CLI also needs to access table metadata. By default this is persisted by Hive via an embedded Derby database into a folder named metastore_db on the local file system (however state can be persisted in any database - including remote mysql instances).

There are two choices on where to run the Hive CLI from:

1. Run Hive CLI from within EC2 - the Hadoop master node being the obvious choice. There are several problems with this approach:

Lack of comprehensive AMIs that bundle different versions of Hive and Hadoop distributions (and the difficulty in doing so considering the large number of such combinations). Cloudera provides some AMIs that bundle Hive with Hadoop - although the choice in terms of Hive and Hadoop versions may be restricted.

Any required map-reduce scripts may also need to be copied to the master/Hive node. If the default Derby database is used - then one has to think about persisting state

beyond the lifetime of one hadoop cluster. S3 is an obvious choice - but the user must restore and backup Hive metadata at the launch and termination of the Hadoop cluster.

2. Run Hive CLI remotely from outside EC2. In this case, the user installs a Hive distribution on a personal workstation, - the main trick with this option is connecting to the Hadoop cluster - both for submitting jobs and for reading and writing files to HDFS. The section on Running jobs from a remote machine details how this can be done. [Case Study 1] goes into the setup for this in more detail. This option solves the problems mentioned above:

Page 72: Apache Hive

Stock Hadoop AMIs can be used. The user can run any version of Hive on their workstation, launch a Hadoop cluster with the desired Hadoop version etc. on EC2 and start running queries.

Map-reduce scripts are automatically pushed by Hive into Hadoop's distributed cache at job submission time and do not need to be copied to the Hadoop machines.

Hive Metadata can be stored on local disk painlessly.

However - the one downside of Option 2 is that jar files are copied over to the Hadoop cluster for each map-reduce job. This can cause high latency in job submission as well as incur some AWS network transmission costs. Option 1 seems suitable for advanced users who have figured out a stable Hadoop and Hive (and potentially external libraries) configuration that works for them and can create a new AMI with the same.

Loading Data into Hive TablesIt is useful to go over the main storage choices for Hadoop/EC2 environment:

S3 is an excellent place to store data for the long term. There are a couple of choices on how S3 can be used:

Data can be either stored as files within S3 using tools like aws and s3curl as detailed in S3 for n00bs section. This suffers from the restriction of 5G limit on file size in S3. But the nice thing is that there are probably scores of tools that can help in copying/replicating data to S3 in this manner. Hadoop is able to read/write such files using the S3N filesystem.

Alternatively Hadoop provides a block based file system using S3 as a backing store. This does not suffer from the 5G max file size restriction. However - Hadoop utilities and libraries must be used for reading/writing such files.

HDFS instance on the local drives of the machines in the Hadoop cluster. The lifetime of this is restricted to that of the Hadoop instance - hence this is not suitable for long lived data. However it should provide data that can be accessed much faster and hence is a good choice for intermediate/tmp data.

Considering these factors, the following makes sense in terms of Hive tables:

1. For long-lived tables, use S3 based storage mechanisms2. For intermediate data and tmp tables, use HDFS

[Case Study 1] shows you how to achieve such an arrangement using the S3N filesystem.

If the user is running Hive CLI from their personal workstation - they can also use Hive's 'load data local' commands as a convenient alternative (to dfs commands) to copy data from their local filesystems (accessible from their workstation) into tables defined over either HDFS or S3.

Submitting jobs to a Hadoop clusterThis applies particularly when Hive CLI is run remotely. A single Hive CLI session can switch across different hadoop clusters (especially as clusters are bought up and terminated). Only two configuration variables:

fs.default.name

Page 73: Apache Hive

mapred.job.trackerneed to be changed to point the CLI from one Hadoop cluster to another. Beware though that tables stored in previous HDFS instance will not be accessible as the CLI switches from one cluster to another. Again - more details can be found in [Case Study 1].

Case Studies1. [Querying files in S3 using EC2, Hive and Hadoop ]

Appendix<<Anchor(S3n00b)>>

S3 for n00bsOne of the things useful to understand is how S3 is used as a file system normally. Each S3 bucket can be considered as a root of a File System. Different files within this filesystem become objects stored in S3 - where the path name of the file (path components joined with '/') become the S3 key within the bucket and file contents become the value. Different tools like [S3Fox|https:--addons.mozilla.org-en-US-firefox-addon-3247] and native S3 !FileSystem in Hadoop (s3n) show a directory structure that's implied by the common prefixes found in the keys. Not all tools are able to create an empty directory. In particular - S3Fox does (by creating a empty key representing the directory). Other popular tools like aws, s3cmd and s3curl provide convenient ways of accessing S3 from the command line - but don't have the capability of creating empty directories.

Amazon Elastic MapReduce and HiveAmazon Elastic MapReduce is a web service that makes it easy to launch managed, resizable Hadoop clusters on the web-scale infrastructure of Amazon Web Services (AWS). Elastic Map Reduce makes it easy for you to launch a Hive and Hadoop cluster, provides you with flexibility to choose different cluster sizes, and allows you to tear them down automatically when processing has completed. You pay only for the resources that you use with no minimums or long-term commitments.

Amazon Elastic MapReduce simplifies the use of Hive clusters by:

1. Handling the provisioning of Hadoop clusters of up to thousands of EC2 instances2. Installing Hadoop across the master and slave nodes of your cluster and configuring Hadoop

based on your chosen hardware3. Installing Hive on the master node of your cluster and configuring it for communication with the

Hadoop JobTracker and NameNode4. Providing a simple API, a web UI, and purpose-built tools for managing, monitoring, and

debugging Hadoop tasks throughout the life of the cluster

Page 74: Apache Hive

5. Providing deep integration, and optimized performance, with AWS services such as S3 and EC2 and AWS features such as Spot Instances, Elastic IPs, and Identity and Access Management (IAM)

Please refer to the following link to view the Amazon Elastic MapReduce Getting Started Guide:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/

Amazon Elastic MapReduce provides you with multiple clients to run your Hive cluster. You can launch a Hive cluster using the AWS Management Console, the Amazon Elastic MapReduce Ruby Client, or the AWS Java SDK. You may also install and run multiple versions of Hive on the same cluster, allowing you to benchmark a newer Hive version alongside your previous version. You can also install a newer Hive version directly onto an existing Hive cluster.

Supported versions:

Hadoop Version

Hive Version

0.18 0.4

0.20 0.5, 0.7, 0.7.1

Hive Defaults

Thrift Communication port

Hive Version

Thrift port

0.4 10000

0.5 10000

0.7 10001

0.7.1 10002

Log File

Hive Log location

Page 75: Apache Hive

Version

0.4 /mnt/var/log/apps/hive.log

0.5 /mnt/var/log/apps/hive_05.log

0.7 /mnt/var/log/apps/hive_07.log

0.7.1 /mnt/var/log/apps/hive_07_1.log

MetaStoreBy default, Amazon Elastic MapReduce uses MySQL, preinstalled on the Master Node, for its Hive metastore. Alternatively, you can use the Amazon Relational Database Service (Amazon RDS) to ensure the metastore is persisted beyond the life of your cluster. This also allows you to share the metastore between multiple Hive clusters. Simply override the default location of the MySQL database to the external persistent storage location.

Hive CLIEMR configures the master node to allow SSH access. You can log onto the master node and execute Hive commands using the Hive CLI. If you have multiple versions of Hive installed on the cluster you can access each one of them via a separate command:

Hive Version

Hive command

0.4 hive

0.5 hive-0.5

0.7 hive-0.7

0.7.1 hive-0.7.1

EMR sets up a separate Hive metastore and Hive warehouse for each installed Hive version on a given cluster. Hence, creating tables using one version does not interfere with the tables created using another version installed. Please note that if you point multiple Hive tables to same location, updates to one table become visible to other tables.

Hive Server

Page 76: Apache Hive

EMR runs a Thrift Hive server on the master node of the Hive cluster. It can be accessed using any JDBC client (for example, squirrel SQL) via Hive JDBC drivers. The JDBC drivers for different Hive versions can be downloaded via the following links:

Hive Version

Hive JDBC

0.5 http://aws.amazon.com/developertools/0196055244487017

0.7 http://aws.amazon.com/developertools/1818074809286277

0.7.1 http://aws.amazon.com/developertools/8084613472207189

Here is the process to connect to the Hive Server using a JDBC driver:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Hive.html#HiveJDBCDriver

Running Batch QueriesYou can also submit queries from the command line client remotely. Please note that currently there is a limit of 256 steps on each cluster. If you have more than 256 steps to execute, it is recommended that you run the queries directly using the Hive CLI or submit queries via a JDBC driver.

Hive S3 TablesAn Elastic MapReduce Hive cluster comes configured for communication with S3. You can create tables and point them to your S3 location and Hive and Hadoop will communicate with S3 automatically using your provided credentials.

Once you have moved data to an S3 bucket, you simply point your table to that location in S3 in order to read or process data via Hive. You can also create partitioned tables in S3. Hive on Elastic MapReduce provides support for dynamic partitioning in S3.

Hive LogsHive application logs: All Hive application logs are redirected to /mnt/var/log/apps/ directory.

Hadoop daemon logs: Hadoop daemon logs are available in /mnt/var/log/hadoop/ folder.

Hadoop task attempt logs are available in /mnt/var/log/hadoop/userlogs/ folder on each slave node in the cluster.

Page 77: Apache Hive

TutorialsThe following Hive tutorials are available for you to get started with Hive on Elastic MapReduce:

1. Finding trending topics using Google Books n-grams data and Apache Hive on Elastic MapReduce

http://aws.amazon.com/articles/Elastic-MapReduce/5249664154115844 2. Contextual Advertising using Apache Hive and Amazon Elastic MapReduce with High

Performance Computing instances http://aws.amazon.com/articles/Elastic-MapReduce/2855

3. Operating a Data Warehouse with Hive, Amazon Elastic MapReduce and Amazon SimpleDB http://aws.amazon.com/articles/Elastic-MapReduce/2854

4. Running Hive on Amazon ElasticMap Reduce http://aws.amazon.com/articles/2857

In addition, Amazon provides step-by-step video tutorials:

http://aws.amazon.com/articles/2862

SupportYou can ask questions related to Hive on Elastic MapReduce on Elastic MapReduce forums at:

https://forums.aws.amazon.com/forum.jspa?forumID=52

Please also refer to the EMR developer guide for more information:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/

Contributed by: Vaibhav Aggarwal