[ieee 2011 annual ieee india conference (indicon) - hyderabad, india (2011.12.16-2011.12.18)] 2011...

Genetic Optimization for the Join Ordering Problem of Database Queries

Swati V. Chande Department of Computer Science

International School of Informatics and Management Jaipur, India

[email protected]

Madhavi Sinha Department of Computer Science

Birla Institute of Technology, Mesra, Jaipur Campus Jaipur, India

[email protected]

Abstract— A query optimizer is a core component of any Database Management System. As a database query optimizer might face different voluminous and complex queries, leading to a huge search space of alternative query plans, it should be appropriate to adapt the search strategy to the problem solving technique which handles complex and large data. Genetic Algorithms (GAs) avoid the high cost of optimization and provide flexibility by being independent of the problem specific knowledge. These qualities make them a viable option for solving the query optimization problem.

All query optimization algorithms primarily deal with joins. Our study concerns the use of GA for join order optimization. Prior theories indicate that genetic algorithms are apposite for optimizing join expressions and produce solutions of high quality within a reasonable running time. We have implemented the GA technique on RDBMS queries and found that the GA based optimizer performs better for queries involving large number of joins.

Keywords- Relational Database Management System; Genetic Algorithm; Crossover; Query Optimization; Query Processing; Join Order Optimization

I. INTRODUCTION Genetic Algorithms(GA) have attracted scientists from

several research areas, specifically those involving optimization, including the database query optimization domain. As a database optimizer might face different and voluminous queries, it should be easy to adapt the search strategy to the problem which handles complex and large data. Genetic strategies do not guarantee that the best solution is obtained but avoid the high cost of optimization and the flexibility that GAs provide by being independent of the problem specific knowledge make them a viable option for solving the query optimization problem. Studies on the applicability of GA to the database query optimization problem have provided encouraging results.

A central issue in relational query optimization is the selection of an effective join ordering, i.e., an order for evaluating efficiently the join predicates of a given query. For example, when joining 3 tables A, B, C of size 10 rows, 10,000 rows, and 1,000,000 rows, respectively, a query plan that joins B and C first can take several orders-of-magnitude more time to execute than one that joins A and C first.

This important problem has been the focus of active research from the first years of relational database development and earlier studies have introduced a host of effective optimization techniques [1, 2, 3, 4, 5, 6]. Most query optimization algorithms therefore, primarily deal with joins. Studies on the use of Genetic Algorithms in Query Optimization [7, 8, 9, 10, 11] also thus primarily focus on joins. Selection of appropriate index for query execution is also one of the major concerns and hence research has also been done on the use of Genetic Algorithms for index selection [12, 13, 14, 15]. Our study concerns the use of GA for join order optimization.

The paper focuses on enhancing and furthering previous work on the creation of a genetic query optimizer. We have worked on a genetic algorithm for optimization of RDBMS queries with large number of joins. Our query optimizer based on genetic algorithm is tested and compared with non genetic query optimizers and also with the only implemented genetic query optimizer, geqo, of PostgreSQL.

The paper is organized as follows: In section 2, we define the problem and the experimental setup, section 3 describes the queries and tools used for the experiments. Section 4 provides a comparative analysis of the performance of our GA based optimizer, and non GA optimizers and it also compares the performance of the postgreSQL genetic optimizer with our solution. Section 5 concludes the paper.

II. PROBLEM DEFINITION The problem of determining good evaluation strategies for

join expressions has existed and has been addressed since the first relational database systems. The objective being, to obtain an optimal plan from the alternative join orders. In this study, the joins are ordered using the Genetic Algorithm approach.

The experiments are then conducted with two major objectives,

i) Study the viability of the proposed approach with reference to the changed computing configurations and data retrieval requirements: The pioneering studies on GA for query optimization have been carried out at least a decade ago and they have addressed queries with not more than 30 relations/ joins. The use of GA based query optimization in present scenario can be inferred to be viable with respect to computing

power and DBMS structures since the computing facilities and data retrieval requirements have changed drastically since then.

ii) Compare the results of our Genetic Query Optimizer (GQO) with existing implementations: The results are compared with the GA implementation in PostgreSQL query optimization since PostgreSQL is the only known DBMS whose query optimizer is partially based on Genetic Algorithm.

The framework of the experiment is shown in Fig. 1.

We have considered Select Project Join (SPJ) queries, queries without aggregates and sub queries. The queries have equijoins and conditions, including composite conditions. The reason for this choice is twofold. First, this analysis involves a large set of executions that are, in general, very time-consuming. Second, SPJ queries are the most frequently used queries.

We could not use the TPC-H queries, the benchmark for performance in database systems, as they do not have the minimum number of tables necessary to test GQO. A different set of queries therefore had to be designed to test the GQO and compare it with other optimizers.

The queries designed have join clauses between all tables and the algorithm defines the better join order. Queries of six sizes were created for testing the GQO. These queries have 15, 20, 30, 40, 50 and 60 tables participating in them. We have numbered these queries Q15, Q20, Q30, Q40, Q50 and Q60 to indicate the number of relations used in them. The total number of rows processed by these queries is more than 500. These queries have been executed 30 times sequentially using i) DB2 and MySQL Non GA optimizers, ii) PostgreSQL GA as well as Non GA optimizer and iii) GQO our GA based Query Optimizer.

All the queries were executed in DB2, MySQL and PostgreSQL, and then results in terms of execution time were checked with the results of the GQO. Cost comparison could not be performed between these two and GQO since the units of cost are not necessarily the same and are not documented. But, since we wanted the comparison of GQO to be made both on time and cost requirements of a GA and a Non GA optimizer, we used PostgreSQL for the implementation of our GQO.

Figure 1. Experiment framework

III. IMPLEMENTATION OF THE GQO IN A RDBMS The classical solution to the query optimization problem,

exhaustive search by way of dynamic programming, offers several advantages over competing techniques, not the least of which is that it guarantees optimality according to a given cost model. On the downside, exhaustive search has exponential complexity, and is reputedly intractable for joins of more than about 8 relations.

However, this reputation is undeserved. Increases in processor speeds, combined with low-overhead implementation techniques, now make it feasible under a wide range of conditions to explore the full space of bushy join plans for joins of at least 15 relations [16]. The GQO was developed to assert the application of genetic algorithms to query optimization for queries with over 15 joins.

In our implementation of the genetic query optimizer (the GQO), the initial population of the steady state GA with chromosomes representing joins between relations using the path representation approach was generated randomly, and duplicate chromosomes were dropped from the population. For the genetic operations we used the Modified Enhanced Edge Recombination Operator given in [17], swap mutation with a mutation rate of 0.005, and linear selection techniques.

To know whether results generated by GQO are practically sensible, i.e. do they display acceptable results in terms of time and cost as compared to non GA and available GA based optimizers, and also to find out if it compares well with the existing implementation, PostgreSQL was chosen. We chose PostgreSQL also because it has several features, making possible the optimization evaluation for complex queries. Another reason for this choice is that it has an implementation (optional) of the genetic algorithm based query optimizer, the only known implementation of GA to query optimization so far. Comparison therefore, was deemed easier. A detailed evaluation of PostgreSQL's algorithm and the developed GQO algorithm was possible because of PostgreSQL’s code availability and systematic coding and documentation.

The PostgreSQL query optimizer is composed by two distinct methods to generate the execution plan of the queries, the Genetic algorithm method being optional. The PostgreSQL Genetic Query Optimization module was implemented by [9] and allows the RDBMS to support large join queries through a non exhaustive search. This optimizer is used automatically for queries with 12 or more relations.

Finally the GQO was tried on queries outside of the sample test queries to know whether it performs well in practice.

The GQO was developed using C/C++ in the Microsoft Visual Studio 2005 environment, to keep compatibility with RDBMS modules. The function call to GQO algorithm was included in the PostgreSQL optimizer function that starts geqo, its genetic optimizer, and regular optimizers. RelOptInfo is the PostgreSQL data structure that represents both base relations (single tables) and join relations. This structure was used in GQO. A RelOptInfo structure is created for all possible joins between pairs of tables, and the possible physical join methods are added in this structure as paths. If the pair does not contain join clause (WHERE) between the relations, one path

GA based query optimizer (GQO)

Check Viability of the GQO

Execution of queries using DB2, MySQL and PostgreSQL non-GA

Execution of queries using PostgreSQL GA optimizer (geqo)

Organizing Inputs & Environmental Settings (Database queries, parameters, and tools)

Check Comparability of the GQO

representing the Cartesian product is generated. The estimate cost for each path is calculated by the PostgreSQL's module, using the optimizer table’s statistics.

The PostgreSQL's code is organized in modules and allpath.c was modified to add the function call to the GQO algorithm in PostgreSQL query optimizer, in make_rel_from_joinlist() function. If the GQO algorithm should be used, the parameter gqo is set to on in the file parameter configuration file postgresql.conf.

This paper focuses mainly on the analysis of queries involving 15, 20, 30, 40, 50 and 60 relations.

The PostgreSQL tools used are,

i) psql, that is a terminal-based frontend, and

ii) the EXPLAIN command with the ANALYZE option to display the total elapsed time expended within each plan node (in milliseconds) and total number of rows it actually returned.

The DB2 tool db2batch, was used to time the SQL queries.

For MySQL, the comparison in time was done for total planning and execution time and correspondingly, for comparison with MySQL the GQO time was also recorded similarly.

The study was performed in the Windows XP environment. No programs except Symantec Antivirus were running on the system during the executions and recordings. The versions of the RDBMS packages used were IBM DB2 Version7, MySQL Version 5.1 and PostgreSQL Version 8.4.

IV. COMPARATIVE ANALYSIS The results of the studies cannot compare the quality of the

optimizers, as the optimizers are conceptually diverse. They determine whether the optimizer could execute the queries, and up to what level.

A. Execution with DB2 Queries with 15, 20, 30 40, 50 and 60 joins were chosen

randomly and were executed 30 times each, for each set of joins in DB2 and for GQO to compare execution time in the two.

The results indicate that the DB2 optimizer performs significantly better than that of the PostgreSQL (and hence GQO) optimizer for 15-40 joins. However, after 40 joins, the DB2 optimizer fails to process queries while the postgreSQL optimizer (GQO), though performing relatively weakly for up to 40 joins, continues to process queries with higher number of joins on the same system, with same configuration and in the same environment.

For 50-Join queries and for 60 join queries, in DB2, the following message was displayed,

** CLI Error in Preparing the statement: (-101): [IBM][ CLI Driver][DB2/NT] SQL0101N The

statement is too long or too complex. SQLSTATE=54001

As the error message explains, it is displayed if the statement could not be processed because it exceeds a system limit for either length or complexity [18]. Thus queries with more than 40 joins could not be processed in DB2.

B. Execution with MySQL As with DB2, queries with 15, 20, 30, 40, 50 and 60 joins

were chosen randomly and were attempted 30 times each, for each set of joins in MySQL and for GQO.

MySQL reported the following when the 40 join queries were attempted.

ERROR 2008 <HY000>:MySQL client ran out of memory.

The results indicate that the MySQL optimizer performs better than that of the PostgreSQL (and hence GQO) optimizer for 15-30 joins in terms of execution time. However, after 30 joins, the MySQL optimizer fails to process queries while the GQO, though performing relatively slowly for up to 30 joins, continues to process queries with higher number of joins on the same system, with same configuration and in the same environment.

The MySQL non GA optimizer does not process over 30 join queries under the given conditions while the GQO does.

C. Execution with PostgreSQL As with DB2 and MySQL, queries with 15, 20, 30, 40, 50

and 60 joins were planned to be chosen randomly and were attempted 30 times each, for each set of joins in PostgreSQL(non GA) optimizer and for GQO.

PostgreSQL optimizer could not perform non GA optimization for 20 and more queries and the process timed out. The GQO performed better than the PostgreSQL non GA optimizer in terms of execution time for 15 join queries, though the difference was not very significant.

D. Comparison of Genetic Algorithm Optimizer with GQO The executions were tried for Q15, Q20, Q30, Q40, Q50

and Q60 and as above 30 runs were performed for each set of joins for time and for cost comparisons between PostgreSQL genetic optimizer (geqo) and GQO.

1) Assessment on Execution Time The time taken to execute the queries was recorded in

milliseconds for all the queries.

The average execution time for GQO and PostgreSQL (geqo) for each of the set of queries is shown in Fig. 2.

It can be deduced from the results that percentage decrease in time from PostgreSQL (geqo) to GQO ranges from 3.49% to 11.33% for the sets of queries used. The overall average for PostgreSQL (geqo) is 10.50 ms and for GQO 9.93 ms. Based on these observations and calculations, the average percentage decrease in execution time from PostgreSQL (geqo) to GQO is 5.42%.

Average execution time for Postgres (geqo) and GQO

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

No. of Runs of each set of Joins

Aver

age

Tim

e (in

ms)

PostgreSQL (geqo)GQO

Figure 2. Average execution time for PostgreSQL (geqo) and GQO

The GQO can thus be said to be bringing out an improvement of 5.42% over the geqo in execution time.

2) Assessment on Execution Cost The readings for cost comparison reveal that the GQO is

costlier than the PostgreSQL (geqo). However when compared with the execution time and execution cost together, the GQO’s performance is better overall. The performance of the GQO is better for 15 join and 20 join queries. But, for queries with more joins, the PostgreSQL geqo performs better.

As per the results, the difference between the cost for PostgreSQL (geqo) and GQO varies from -3.3% to 6.3%. Thus, in case of cost of execution, the PostgreSQL geqo’s performance is better than the GQO’s performance by 2.99%.

E. Analysis of Results The DB2, MySQL, and PostgreSQL non GA optimizers

could not process queries with over 40, 30 and 20 relations respectively, under the given conditions, while the GQO could perform more complex queries. The GQO, due to the inherent property of a GA to perform fitness based selection and to evolve more fit solutions, is able to process more complex queries in less time, as has been proven in previous GA based studies on query optimization too.

For the GQO, the cost function and the data structures of the PostgreSQL have been used. If the cost models used in DB2/ MySQL are used, the genetic query optimizer would benefit from it and the differences between the performances for these database systems could be amplified if the comparison is done on GA and non GA optimizers for the same systems.

The GQO performed better than PostgreSQL’s geqo too. In GQO, the duplicate chromosomes have been restricted to increase diversity in the initial population, the query plans are chosen to be crossed and replaced on the basis of their cost so that the optimizer focuses on the best properties in the population, and converges faster, thereby improving the quality. The combination of parameters and operators used in the GQO thus provides satisfactory results.

We executed 30 queries with 15 to 60 joins which were not included in the above experiments (i.e. not in Q15 – Q60 set)

and found that the results lied within the range of the GQO. This was done to ascertain that the GQO performs well in practice, i.e. with an out-of-sample test.

V. CONCLUSION The experiments performed have shown that the genetic

approach to query optimization is a viable technique even for queries with increased complexities and the advanced computing configurations. In general, we have shown that our technique presents a better performance than the standard RDBMS optimizers of DB2 and MySQL when the number of relations in a query is large and has an improved performance compared to the existing genetic implementation for query optimization in PostgreSQL.

To our knowledge, we have presented the first comparison between the optimizers of two well-known RDBMS packages and the genetic approach. We have proven that, for queries with less (upto 10) and moderate (upto 30) number of joins, the deterministic optimizers perform better. The GAs do well beyond this, i.e. for queries with over 30 joins. We have demonstrated that the combination of techniques presented in this paper clearly outperforms previous proposals in terms of speed. In terms of cost, too there is an improvement for queries up to 20 joins, and beyond that the cost is slightly higher, but definitely within acceptable limits.

REFERENCES [1] S. Chaudhuri and K. Shim, “Optimization of queries with user-defined

predicates”, VLDB, 1996, pp 87-98. [2] G. Graefe and D. J. DeWitt “The exodus optimizer generator”, ACM

SIGMOD, 1987, pp 160-172. [3] Y.E. Ioannidis and Y.C. Kang, "Randomized Algorithms for Optimizing

Large Join Queries", Proc. SIGMOD Conference, 1990, pp.312-321. [4] R. Krishnamurthy, B. Boral, and C. Zaniolo,.” Optimization of

nonrecursive queries, VLDB, 1986, pp 128-137. [5] H. Pirahesh, J. M. Hellerstein, and W. Hasan,.” Extensible/rule based

query rewrite optimization in starburst”, ACM SIGMOD Conf., 1992, pp 39-48.

[6] P. G. Selinger, M. M. Astrahan, R. D. Chamberlin, R. A. Lorie, and T. G. Price, “Access path selection in a relational database management system”, ACM SIGMOD, 1979, pp 23-34.

[7] Michael Steinbrunn, Guido Moerkotte, and Alfons Kemper, “Heuristic and randomized optimization for the join ordering problem”, The VLDB Journal, vol. 6, number 3, 1997, pp 191–208.

[8] K. Bennett, M. Ferris and Y. Ioannidis, “A genetic algorithm for database query optimization”, Proceedings of the 4th International Conference on Genetic Algorithms, 1991, pp 400-407.

[9] M Utesch, “Genetic Query Optimizer”, PostgreSQL Documentation, 1998.

[10] S Vellev, “An adaptive genetic algorithm with dynamic population size for optimizing join queries”, Advanced Research in Artificial Intelligence, Supplement to International Journal ‘Information Technologies and Knowledge’, vol. 2, 2008, pp 82-88.

[11] L Fang, P Wang and J Yan, “A Multi-copy Join Optimization of Information Integration Systems Based on A Genetic Algorithm”, Proceedings of the The Third International Multi-Conference on Computing in the Global Information Technology, 2008, pp 223-228.

[12] F. Fotouhi and C E Galarce, “Genetic Algorithms and the Search for Optimal Database Index Selection”, Proceedings of the Great Lakes Computer Science Conference, 1991, pp 249-255.

[13] J Celko, “Genetic Algorithms and Database Indexing”, Dr. Dobb's Journal, April 1993.

[14] J Kratica, I Ljubic and D Tošic,“A genetic algorithm for the index selection problem”, Applications of Evolutionary Computing: EvoWorkshops2003, vol. 2611 of LNCS, University of Essex, England, UK, 14-16, Springer-Verlag, 2003, pp 280-290.

[15] V Kovačević and B Filipič, “A genetic algorithm based tool for the database index selection problem”, Proceedings of the 8th International Multiconference Information Society IS, 2005, pp 378-381.

[16] Bennet Vance and David Maier, “Join-order optimization with cartesian products”, Doctoral Dissertation, Oregon Graduate Institute of Science and Technology 1998.

[17] Y C Tang and K S Leung, “A modified edge recombination operator for the travelling salesman problem”, PPSN III: Proceedings of the International Conference on Evolutionary Computation, The Third Conference on Parallel Problem Solving from Nature, 1994, pp 180-188.

[18] IBM DB2 Universal Database Message Reference, vol. 2, Version 7.

[ieee 2011 annual ieee india conference (indicon) - hyderabad, india (2011.12.16-2011.12.18)] 2011...

Documents