2012 08-11 query optimation using column statistic in hive

30
IPL Seminar Query Optimization Using Column Statistics in Hive Nguyen Minh Quy Ha Noi University of Science and Technology IPL Camp August 11 th , 2012

Upload: quynm

Post on 13-Jul-2015

303 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: 2012 08-11 query optimation using column statistic in hive

IPL Seminar

Query Optimization

Using Column Statistics in Hive

Nguyen Minh Quy

Ha Noi University of Science and Technology

IPL Camp August 11th, 2012

Page 2: 2012 08-11 query optimation using column statistic in hive

Brief introduction to Hadoop& Hive

HADOOP

Hadoop is a framework for running applications on

large cluster built of commodity hardware (Hadoop

wiki)

Hadoop contains MapReduce framework for

parallel programming and HDFS for storing

distributed data.

It requires programming skills to work with (Not

easy to everyone).

Page 3: 2012 08-11 query optimation using column statistic in hive

Brief introduction to Hadoop& Hive

HIVE

Hive is a data warehouse software facilitates

querying and managing large datasets residing in

distributed storage

Hive is built on top of Hadoop

Hive has a language like SQL that be called

HiveQL for manipulating data.

It’s very easy to work with.

Page 4: 2012 08-11 query optimation using column statistic in hive

Brief introduction to Hadoop& Hive

Hive architecture

HADOOP FRAMEWORK

HIVE FRAMEWORK

Page 5: 2012 08-11 query optimation using column statistic in hive

Brief introduction to Hadoop& Hive

One of major Hive’s drawbacks

Joining tables in Hive can take so much time to

done. Especially when we join big tables.

Therefore, need have more research about

improving join performance.

The paper I will talk about today is an example

Page 6: 2012 08-11 query optimation using column statistic in hive

Query Optimization Using Column Statistics in Hive

ACM 2011, Septembre 21-23

Anja Gruenheid,

Edward Omiecinski, Leo Mark

Georgia Institute of Technology

Page 7: 2012 08-11 query optimation using column statistic in hive

1. What is the Paper’s Idea?

We assume have 3 tables: Orders, Customers

and LineItems and the relationship as follow:

#Records (Tuples) in the tables are 150, 1500,

6000 respectively

Orders

OrderID

CustomerID

……

Customers

CustomerID

…….

LineItems

ProductID

OrderID

………

1,500 6,000 150

Page 8: 2012 08-11 query optimation using column statistic in hive

1. What is the Paper’s Idea?

Sometimes we need join the tables like that:

…Customers Orders LineItems…

Here, we have 2 orders/choices to take the join:

Customers Orders

150 1,500

Cost1=150*1500

Join order 1

Page 9: 2012 08-11 query optimation using column statistic in hive

1. What is the Paper’s Idea?

Sometimes we need join the tables like that:

…Customers Orders LineItems…

Here, we have 2 orders/choices to take the join:

LineItems

Customers Orders

6000

150 1,500

Cost1=150*1500+150*1500*6000=1,350,225,000

Join order 1

Page 10: 2012 08-11 query optimation using column statistic in hive

1. What is the Paper’s Idea?

Sometimes we need join the tables like that:

…Customers Orders LineItems…

Here, we have 2 orders/choices to take the join:

LineItems Orders

1,500 6,000

Cost2=1500*6000

Join order 2

Page 11: 2012 08-11 query optimation using column statistic in hive

1. What is the Paper’s Idea?

Sometimes we need join the tables like that:

….Customers Orders LineItems…

Here, we have 2 orders/choices to take the join:

Customers

LineItems Orders

1,500 6,000

150

Cost1=150*1500+150*1500*6000=1,350,225,000 Cost2=1500*6000+1500*6000*150=1,359,000,000

Join order 2

Page 12: 2012 08-11 query optimation using column statistic in hive

1. What is the Paper’s Idea?

Sometimes we need join the tables like that:

Customers Orders LineItems

Here, we have 2 orders to take the join:

LineItems

Customers Orders

Customers

LineItems Orders

6000

150 1,500 1,500 6,000

150

Cost1=150*1500+150*1500*6000=1,350,225,000 Cost2=1500*6000+1500*6000*150=1,359,000,000

Cost1 < Cost2 means:

The Order 1 better than the Order 2

The Join order can improve the

performance! (Paper’s idea)

Join order 1 Join order 2

Page 13: 2012 08-11 query optimation using column statistic in hive

2. What is purpose of the paper?

Paper’s purpose: With a join given, find out

one join order which returns the best cost.

Page 14: 2012 08-11 query optimation using column statistic in hive

3. How ?

To find out one join order which return the best

cost. The authors used a cost estimation

function to make decision, in which:

One join order will be chosen if the cost

estimation function return the best cost

(Lowest value).

Page 15: 2012 08-11 query optimation using column statistic in hive

3. How ?

There are 2 approachs to built the estimation

function:

Classical query optimization, we denote: CA

MapReduce-Adapted query optimization: MRA

Page 16: 2012 08-11 query optimation using column statistic in hive

Classical query optimization

The cost estimation function is defined as:

|T| denote cardinality of the join.

T1 & T2: left side and right side of the join.

T1 T2

Page 17: 2012 08-11 query optimation using column statistic in hive

Classical query optimization

Return the example before with 2 join orders:

1. T= Customers Orders LineItems

CCA(T) = 1,350,225,000

2. T= Orders LineItems Customers

CCA(T) = 1,359,000,000

The first join order is preferred/selected

Page 18: 2012 08-11 query optimation using column statistic in hive

MapReduce-Adapted

query optimization

Because Hadoop always execute Map and

Reduce phase for each Join operation, so

there will be many Map/Reduce iterations

when the number of tables increased.

The time taken by I/O operations (Map=Read;

Reduce = write) is so big due to the big data.

Need re-calculate the cost function:

Page 19: 2012 08-11 query optimation using column statistic in hive

MapReduce-Adapted

query optimization

Return the example before with 2 join orders:

1. T= Customers Orders LineItems

CMRA(T) = 150 * 1,500 = 225,000

2. T= Orders LineItems Customers

CMRA(T) = 1,500 * 6,000 = 9,000,000

The first join order is preferred/selected again

Page 20: 2012 08-11 query optimation using column statistic in hive

MapReduce-Adapted vs Classical

query optimization

MapReduce-Adapted: classic approach uses

the intermediate cost of previous joins as

additional cost to the join of two subtrees

MapReduce approach focuses on number of

tuples that are generated during the join and

I/O between a reduce and a following map

phase.

Page 21: 2012 08-11 query optimation using column statistic in hive

MapReduce-Adapted vs Classical

query optimization

The authors shown that MapReduce-Adapted

approach is better than Classical approach

Another example (4 tables and 3 joins):

algebra structure as follow (not optimized):

Page 22: 2012 08-11 query optimation using column statistic in hive

MapReduce-Adapted vs Classical

query optimization

After optimized by Classical approach,

new join order becomes to:

Page 23: 2012 08-11 query optimation using column statistic in hive

MapReduce-Adapted vs Classical

query optimization

But, if apply MapReduce-adapted approach,

we have another order:

Page 24: 2012 08-11 query optimation using column statistic in hive

MapReduce-Adapted vs Classical

query optimization

Execution result:

Page 25: 2012 08-11 query optimation using column statistic in hive

Query optimization

using Hive column statistic

The approachs mentioned before all use

cardinality of table (table size) in their cost

function.

Some cases, based on Column statistic we

can chose another instead of Table size, that

improve performance.

Page 26: 2012 08-11 query optimation using column statistic in hive

What is the column statistic?

Column statistic: Generating some useful

statistic information on columns of table,

for example: Max/Min value of the Colum;

highest frequency value ; number of

distinct value,….

Page 27: 2012 08-11 query optimation using column statistic in hive

Why column statistic can help

improve performance?

Let’s consider this case:

SELECT ……..

FROM Customers Orders LineItems

WHERE LineItems.Col1 = V

We assume that, column statistic return

Highest frequency in Col1 is very small (100 for

example). Therefore, the maximum of number

rows returned from the LineItems is 100.

Clearly, this number <<< |LineItems| (6000)

Page 28: 2012 08-11 query optimation using column statistic in hive

Why column statistic can help

improve performance?

The Cost function indicate that: The #2 join

order is now better #1 join order, Indeed:

LineItems

Customers Orders

Customers

LineItems Orders

100

6000

150 1,500 1,500 100

6,000

150

Cost2=100*1500+1500*100*150=22,650,000

Join order 1 Join order 2

Cost1=150*1500+150*1500*100 = 22,725,000

Page 29: 2012 08-11 query optimation using column statistic in hive

Why column statistic can help

improve performance?

The Cost function indicate that: The #2 join

order is now better #1 join order, Indeed:

LineItems

Customers Orders

Customers

LineItems Orders

100 6000

150 1,500 1,500 6,000

150

Cost2=1500*6000+1500*6000*150=1,359,000,000

Join order 1 Join order 2

Cost1=150*1500+150*1500*6000=1,350,225,000

Cost2 now < Cost1 means:

The Order 2 better than the Order 1

Join order 2 now chosen

instead of Join order 1

Page 30: 2012 08-11 query optimation using column statistic in hive

Thank you!