2012 08-11 query optimation using column statistic in hive

IPL Seminar

Query Optimization

Using Column Statistics in Hive

Nguyen Minh Quy

Ha Noi University of Science and Technology

IPL Camp August 11th, 2012

Brief introduction to Hadoop& Hive

HADOOP

Hadoop is a framework for running applications on

large cluster built of commodity hardware (Hadoop

wiki)

Hadoop contains MapReduce framework for

parallel programming and HDFS for storing

distributed data.

It requires programming skills to work with (Not

easy to everyone).


HIVE

Hive is a data warehouse software facilitates

querying and managing large datasets residing in

distributed storage

Hive is built on top of Hadoop

Hive has a language like SQL that be called

HiveQL for manipulating data.

It’s very easy to work with.


Hive architecture

HADOOP FRAMEWORK

HIVE FRAMEWORK


One of major Hive’s drawbacks

Joining tables in Hive can take so much time to

done. Especially when we join big tables.

Therefore, need have more research about

improving join performance.

The paper I will talk about today is an example

Query Optimization Using Column Statistics in Hive

ACM 2011, Septembre 21-23

Anja Gruenheid,

Edward Omiecinski, Leo Mark

Georgia Institute of Technology

1. What is the Paper’s Idea?

We assume have 3 tables: Orders, Customers

and LineItems and the relationship as follow:

#Records (Tuples) in the tables are 150, 1500,

6000 respectively

Orders

OrderID

CustomerID

……

Customers

CustomerID

…….

LineItems

ProductID

OrderID

………

1,500 6,000 150


Sometimes we need join the tables like that:

…Customers Orders LineItems…

Here, we have 2 orders/choices to take the join:

Customers Orders

150 1,500

Cost1=150*1500

Join order 1





LineItems

Customers Orders

6000

150 1,500

Cost1=150*1500+150*1500*6000=1,350,225,000

Join order 1





LineItems Orders

1,500 6,000

Cost2=1500*6000

Join order 2



….Customers Orders LineItems…


Customers

LineItems Orders

1,500 6,000

150

Cost1=150*1500+150*1500*6000=1,350,225,000 Cost2=1500*6000+1500*6000*150=1,359,000,000

Join order 2



Customers Orders LineItems

Here, we have 2 orders to take the join:

LineItems

Customers Orders

Customers

LineItems Orders

6000

150 1,500 1,500 6,000

150

Cost1=150*1500+150*1500*6000=1,350,225,000 Cost2=1500*6000+1500*6000*150=1,359,000,000

Cost1 < Cost2 means:

The Order 1 better than the Order 2

The Join order can improve the

performance! (Paper’s idea)

Join order 1 Join order 2

2. What is purpose of the paper?

Paper’s purpose: With a join given, find out

one join order which returns the best cost.

3. How ?

To find out one join order which return the best

cost. The authors used a cost estimation

function to make decision, in which:

One join order will be chosen if the cost

estimation function return the best cost

(Lowest value).

3. How ?

There are 2 approachs to built the estimation

function:

Classical query optimization, we denote: CA

MapReduce-Adapted query optimization: MRA

Classical query optimization

The cost estimation function is defined as:

|T| denote cardinality of the join.

T1 & T2: left side and right side of the join.

T1 T2

Classical query optimization

Return the example before with 2 join orders:

1. T= Customers Orders LineItems

CCA(T) = 1,350,225,000

2. T= Orders LineItems Customers

CCA(T) = 1,359,000,000

The first join order is preferred/selected

MapReduce-Adapted

query optimization

Because Hadoop always execute Map and

Reduce phase for each Join operation, so

there will be many Map/Reduce iterations

when the number of tables increased.

The time taken by I/O operations (Map=Read;

Reduce = write) is so big due to the big data.

Need re-calculate the cost function:

MapReduce-Adapted

query optimization

Return the example before with 2 join orders:

1. T= Customers Orders LineItems

CMRA(T) = 150 * 1,500 = 225,000

2. T= Orders LineItems Customers

CMRA(T) = 1,500 * 6,000 = 9,000,000

The first join order is preferred/selected again

MapReduce-Adapted vs Classical

query optimization

MapReduce-Adapted: classic approach uses

the intermediate cost of previous joins as

additional cost to the join of two subtrees

MapReduce approach focuses on number of

tuples that are generated during the join and

I/O between a reduce and a following map

phase.


query optimization

The authors shown that MapReduce-Adapted

approach is better than Classical approach

Another example (4 tables and 3 joins):

algebra structure as follow (not optimized):


query optimization

After optimized by Classical approach,

new join order becomes to:


query optimization

But, if apply MapReduce-adapted approach,

we have another order:


query optimization

Execution result:

Query optimization

using Hive column statistic

The approachs mentioned before all use

cardinality of table (table size) in their cost

function.

Some cases, based on Column statistic we

can chose another instead of Table size, that

improve performance.

What is the column statistic?

Column statistic: Generating some useful

statistic information on columns of table,

for example: Max/Min value of the Colum;

highest frequency value ; number of

distinct value,….

Why column statistic can help

improve performance?

Let’s consider this case:

SELECT ……..

FROM Customers Orders LineItems

WHERE LineItems.Col1 = V

We assume that, column statistic return

Highest frequency in Col1 is very small (100 for

example). Therefore, the maximum of number

rows returned from the LineItems is 100.

Clearly, this number <<< |LineItems| (6000)



The Cost function indicate that: The #2 join

order is now better #1 join order, Indeed:

LineItems

Customers Orders

Customers

LineItems Orders

100

6000

150 1,500 1,500 100

6,000

150

Cost2=100*1500+1500*100*150=22,650,000


Cost1=150*1500+150*1500*100 = 22,725,000



The Cost function indicate that: The #2 join

order is now better #1 join order, Indeed:

LineItems

Customers Orders

Customers

LineItems Orders

100 6000

150 1,500 1,500 6,000

150

Cost2=1500*6000+1500*6000*150=1,359,000,000


Cost1=150*1500+150*1500*6000=1,350,225,000

Cost2 now < Cost1 means:

The Order 2 better than the Order 1

Join order 2 now chosen

instead of Join order 1

Thank you!

2012 08-11 query optimation using column statistic in hive

Technology

hadoop hive hive hive

customers orders lineitems

join performance

join order wil

hadoop hive hadoop hadoop

papers idea

hive acm

distributed storage