2012 08-11 query optimation using column statistic in hive
TRANSCRIPT
IPL Seminar
Query Optimization
Using Column Statistics in Hive
Nguyen Minh Quy
Ha Noi University of Science and Technology
IPL Camp August 11th, 2012
Brief introduction to Hadoop& Hive
HADOOP
Hadoop is a framework for running applications on
large cluster built of commodity hardware (Hadoop
wiki)
Hadoop contains MapReduce framework for
parallel programming and HDFS for storing
distributed data.
It requires programming skills to work with (Not
easy to everyone).
Brief introduction to Hadoop& Hive
HIVE
Hive is a data warehouse software facilitates
querying and managing large datasets residing in
distributed storage
Hive is built on top of Hadoop
Hive has a language like SQL that be called
HiveQL for manipulating data.
It’s very easy to work with.
Brief introduction to Hadoop& Hive
Hive architecture
HADOOP FRAMEWORK
HIVE FRAMEWORK
Brief introduction to Hadoop& Hive
One of major Hive’s drawbacks
Joining tables in Hive can take so much time to
done. Especially when we join big tables.
Therefore, need have more research about
improving join performance.
The paper I will talk about today is an example
Query Optimization Using Column Statistics in Hive
ACM 2011, Septembre 21-23
Anja Gruenheid,
Edward Omiecinski, Leo Mark
Georgia Institute of Technology
1. What is the Paper’s Idea?
We assume have 3 tables: Orders, Customers
and LineItems and the relationship as follow:
#Records (Tuples) in the tables are 150, 1500,
6000 respectively
Orders
OrderID
CustomerID
……
Customers
CustomerID
…….
LineItems
ProductID
OrderID
………
1,500 6,000 150
1. What is the Paper’s Idea?
Sometimes we need join the tables like that:
…Customers Orders LineItems…
Here, we have 2 orders/choices to take the join:
Customers Orders
150 1,500
Cost1=150*1500
Join order 1
1. What is the Paper’s Idea?
Sometimes we need join the tables like that:
…Customers Orders LineItems…
Here, we have 2 orders/choices to take the join:
LineItems
Customers Orders
6000
150 1,500
Cost1=150*1500+150*1500*6000=1,350,225,000
Join order 1
1. What is the Paper’s Idea?
Sometimes we need join the tables like that:
…Customers Orders LineItems…
Here, we have 2 orders/choices to take the join:
LineItems Orders
1,500 6,000
Cost2=1500*6000
Join order 2
1. What is the Paper’s Idea?
Sometimes we need join the tables like that:
….Customers Orders LineItems…
Here, we have 2 orders/choices to take the join:
Customers
LineItems Orders
1,500 6,000
150
Cost1=150*1500+150*1500*6000=1,350,225,000 Cost2=1500*6000+1500*6000*150=1,359,000,000
Join order 2
1. What is the Paper’s Idea?
Sometimes we need join the tables like that:
Customers Orders LineItems
Here, we have 2 orders to take the join:
LineItems
Customers Orders
Customers
LineItems Orders
6000
150 1,500 1,500 6,000
150
Cost1=150*1500+150*1500*6000=1,350,225,000 Cost2=1500*6000+1500*6000*150=1,359,000,000
Cost1 < Cost2 means:
The Order 1 better than the Order 2
The Join order can improve the
performance! (Paper’s idea)
Join order 1 Join order 2
2. What is purpose of the paper?
Paper’s purpose: With a join given, find out
one join order which returns the best cost.
3. How ?
To find out one join order which return the best
cost. The authors used a cost estimation
function to make decision, in which:
One join order will be chosen if the cost
estimation function return the best cost
(Lowest value).
3. How ?
There are 2 approachs to built the estimation
function:
Classical query optimization, we denote: CA
MapReduce-Adapted query optimization: MRA
Classical query optimization
The cost estimation function is defined as:
|T| denote cardinality of the join.
T1 & T2: left side and right side of the join.
T1 T2
Classical query optimization
Return the example before with 2 join orders:
1. T= Customers Orders LineItems
CCA(T) = 1,350,225,000
2. T= Orders LineItems Customers
CCA(T) = 1,359,000,000
The first join order is preferred/selected
MapReduce-Adapted
query optimization
Because Hadoop always execute Map and
Reduce phase for each Join operation, so
there will be many Map/Reduce iterations
when the number of tables increased.
The time taken by I/O operations (Map=Read;
Reduce = write) is so big due to the big data.
Need re-calculate the cost function:
MapReduce-Adapted
query optimization
Return the example before with 2 join orders:
1. T= Customers Orders LineItems
CMRA(T) = 150 * 1,500 = 225,000
2. T= Orders LineItems Customers
CMRA(T) = 1,500 * 6,000 = 9,000,000
The first join order is preferred/selected again
MapReduce-Adapted vs Classical
query optimization
MapReduce-Adapted: classic approach uses
the intermediate cost of previous joins as
additional cost to the join of two subtrees
MapReduce approach focuses on number of
tuples that are generated during the join and
I/O between a reduce and a following map
phase.
MapReduce-Adapted vs Classical
query optimization
The authors shown that MapReduce-Adapted
approach is better than Classical approach
Another example (4 tables and 3 joins):
algebra structure as follow (not optimized):
MapReduce-Adapted vs Classical
query optimization
After optimized by Classical approach,
new join order becomes to:
MapReduce-Adapted vs Classical
query optimization
But, if apply MapReduce-adapted approach,
we have another order:
MapReduce-Adapted vs Classical
query optimization
Execution result:
Query optimization
using Hive column statistic
The approachs mentioned before all use
cardinality of table (table size) in their cost
function.
Some cases, based on Column statistic we
can chose another instead of Table size, that
improve performance.
What is the column statistic?
Column statistic: Generating some useful
statistic information on columns of table,
for example: Max/Min value of the Colum;
highest frequency value ; number of
distinct value,….
Why column statistic can help
improve performance?
Let’s consider this case:
SELECT ……..
FROM Customers Orders LineItems
WHERE LineItems.Col1 = V
We assume that, column statistic return
Highest frequency in Col1 is very small (100 for
example). Therefore, the maximum of number
rows returned from the LineItems is 100.
Clearly, this number <<< |LineItems| (6000)
Why column statistic can help
improve performance?
The Cost function indicate that: The #2 join
order is now better #1 join order, Indeed:
LineItems
Customers Orders
Customers
LineItems Orders
100
6000
150 1,500 1,500 100
6,000
150
Cost2=100*1500+1500*100*150=22,650,000
Join order 1 Join order 2
Cost1=150*1500+150*1500*100 = 22,725,000
Why column statistic can help
improve performance?
The Cost function indicate that: The #2 join
order is now better #1 join order, Indeed:
LineItems
Customers Orders
Customers
LineItems Orders
100 6000
150 1,500 1,500 6,000
150
Cost2=1500*6000+1500*6000*150=1,359,000,000
Join order 1 Join order 2
Cost1=150*1500+150*1500*6000=1,350,225,000
Cost2 now < Cost1 means:
The Order 2 better than the Order 1
Join order 2 now chosen
instead of Join order 1
Thank you!