research article distributed query plan generation using ...downloads.hindawi.com › journals ›...

18
Research Article Distributed Query Plan Generation Using Multiobjective Genetic Algorithm Shina Panicker 1 and T. V. Vijay Kumar 2 1 SFIO-NIC Division, National Informatics Center, Ministry of Information Technology, New Delhi 110003, India 2 School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi 110067, India Correspondence should be addressed to T. V. Vijay Kumar; [email protected] Received 23 August 2013; Accepted 23 January 2014; Published 14 May 2014 Academic Editors: L. D. S. Coelho, Y. Jiang, W. Su, and G. A. Trunfio Copyright © 2014 S. Panicker and T. V. Vijay Kumar. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. A distributed query processing strategy, which is a key performance determinant in accessing distributed databases, aims to minimize the total query processing cost. One way to achieve this is by generating efficient distributed query plans that involve fewer sites for processing a query. In the case of distributed relational databases, the number of possible query plans increases exponentially with respect to the number of relations accessed by the query and the number of sites where these relations reside. Consequently, computing optimal distributed query plans becomes a complex problem. is distributed query plan generation (DQPG) problem has already been addressed using single objective genetic algorithm, where the objective is to minimize the total query processing cost comprising the local processing cost (LPC) and the site-to-site communication cost (CC). In this paper, this DQPG problem is formulated and solved as a biobjective optimization problem with the two objectives being minimize total LPC and minimize total CC. ese objectives are simultaneously optimized using a multiobjective genetic algorithm NSGA-II. Experimental comparison of the proposed NSGA-II based DQPG algorithm with the single objective genetic algorithm shows that the former performs comparatively better and converges quickly towards optimal solutions for an observed crossover and mutation probability. 1. Introduction Advancement in technology has made it possible today to gather timely and effective information from vast sources of data (sites) distributed geographically across a network. e users at local sites can work independently as well as communicate with other sites to retrieve data for answering global queries. Such a setup is referred to as a Distributed Database System (DDS) [1, 2]. Query posed on a DDS is generally decomposed into subqueries, which are processed at the respective local sites where the data resides, before being transmitted to another site for cumulative processing of distributed data fragments. At the user end, an integrated result is displayed. A distributed query processing strategy aims to minimize the overall cost of query processing in such systems [3]. e cost of query processing in a DDS comprises of two costs: the local processing cost and the site- to-site communication or the transmission cost of relation fragments. e total cost incurred in processing a distributed query can thus be taken as the sum of the local processing cost at the individual participating sites and the cost of data communication among these sites. Local processing cost comprises of the cost of join operation on relations accessed by the user query and the communication cost is proportional to the size of relation fragments being transmitted among sites and its cost of transmission. ese costs need to be minimized in order to minimize the total query processing cost. In today’s scenario, with a multifold increase in the size of DDS, the communication cost asserts a major impact on the overall cost of query processing. e cost incurred in com- municating data through a congested network path or the communication of large data units between sites with higher communication costs can highly influence the cost of query processing and thus the sequence of sites through which the data fragments get processed has a significant impact on Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 628471, 17 pages http://dx.doi.org/10.1155/2014/628471

Upload: others

Post on 27-Jun-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

Research ArticleDistributed Query Plan Generation UsingMultiobjective Genetic Algorithm

Shina Panicker1 and T V Vijay Kumar2

1 SFIO-NIC Division National Informatics Center Ministry of Information Technology New Delhi 110003 India2 School of Computer and Systems Sciences Jawaharlal Nehru University New Delhi 110067 India

Correspondence should be addressed to T V Vijay Kumar tvvijaykumarhotmailcom

Received 23 August 2013 Accepted 23 January 2014 Published 14 May 2014

Academic Editors L D S Coelho Y Jiang W Su and G A Trunfio

Copyright copy 2014 S Panicker and T V Vijay Kumar This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

A distributed query processing strategy which is a key performance determinant in accessing distributed databases aims tominimize the total query processing cost One way to achieve this is by generating efficient distributed query plans that involvefewer sites for processing a query In the case of distributed relational databases the number of possible query plans increasesexponentially with respect to the number of relations accessed by the query and the number of sites where these relations resideConsequently computing optimal distributed query plans becomes a complex problem This distributed query plan generation(DQPG) problem has already been addressed using single objective genetic algorithm where the objective is to minimize the totalquery processing cost comprising the local processing cost (LPC) and the site-to-site communication cost (CC) In this paperthis DQPG problem is formulated and solved as a biobjective optimization problem with the two objectives being minimize totalLPC and minimize total CC These objectives are simultaneously optimized using a multiobjective genetic algorithm NSGA-IIExperimental comparison of the proposed NSGA-II based DQPG algorithm with the single objective genetic algorithm shows thatthe former performs comparatively better and converges quickly towards optimal solutions for an observed crossover andmutationprobability

1 Introduction

Advancement in technology has made it possible today togather timely and effective information from vast sourcesof data (sites) distributed geographically across a networkThe users at local sites can work independently as well ascommunicate with other sites to retrieve data for answeringglobal queries Such a setup is referred to as a DistributedDatabase System (DDS) [1 2] Query posed on a DDS isgenerally decomposed into subqueries which are processedat the respective local sites where the data resides beforebeing transmitted to another site for cumulative processingof distributed data fragments At the user end an integratedresult is displayed A distributed query processing strategyaims to minimize the overall cost of query processing insuch systems [3] The cost of query processing in a DDScomprises of two costs the local processing cost and the site-to-site communication or the transmission cost of relation

fragmentsThe total cost incurred in processing a distributedquery can thus be taken as the sum of the local processingcost at the individual participating sites and the cost ofdata communication among these sites Local processing costcomprises of the cost of join operation on relations accessedby the user query and the communication cost is proportionalto the size of relation fragments being transmitted amongsites and its cost of transmission These costs need to beminimized in order to minimize the total query processingcost

In todayrsquos scenario with a multifold increase in the size ofDDS the communication cost asserts a major impact on theoverall cost of query processing The cost incurred in com-municating data through a congested network path or thecommunication of large data units between sites with highercommunication costs can highly influence the cost of queryprocessing and thus the sequence of sites through which thedata fragments get processed has a significant impact on

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 628471 17 pageshttpdxdoiorg1011552014628471

2 The Scientific World Journal

the overall query processing cost It thus also plays a key rolein determining the overall performance of a DDS There canbe a number of possible ways to process and communicaterelation fragments involved in the query A distributed queryprocessing strategy evaluates all possible sequence of sitescorresponding to relations accessed in the query referredto as a query plan and determines the most optimal queryplan that minimizes the total cost that is local processing(CPU IO) cost and communication cost [5ndash11]The numberof possible query plans grows at least exponentially withthe increase in number of relations accessed by the query[12 13]This number increases further if relations accessed bythe query are replicated across multiple sites Performing anexhaustive search on all possible combinations of query plansis not feasible due to a vast search space Therefore in largeDDS devising a query processing strategy that optimizes thetotal query processing cost is shown to be a combinatorialoptimization problem [10]

Over the last three decades many algorithms and tech-niques have been devised to solve the class of combinatorialoptimization problems Initially the rigorous mathemati-cal and search based techniques like simulated annealingrandom search algorithms dynamic programming and soforth were used to solve such problems which thoughworked well with moderate sized problems on cost heuristiccould not succeed with complex multiobjective problemsThese mechanisms suffered from a drawback at certaininstances where they converged to local optima withoutexploring the entire search space [8 13 14] However inthe last two decades evolutionary techniques have gainedimmense popularity due to their applicability in solving thesecomplex scientific and engineering optimization problemsThese algorithms are inspired by the Darwinian evolutionthat accentuates the concept of ldquoSurvival of the Fittestrdquo [15]It is thus metaphorical to the natural social behavior andbiological evolution of species The evolutionary techniquesare now proved to be themost proficientmethod of choice forsolving such problems Genetic algorithm based techniqueswhich belong to the class of evolutionary algorithms havealso been widely used in solving complex real life science andengineering problemsThe strength of GA as a metaheuristiccomes from its ability to combine the good features fromseveral solutions to create new and better solutions [16 17]over generations

Most real world scientific and engineering problemshave often conflicting and competing objectives that needto be optimized The evolutionary strategies are proved tobe best suited for this class of problems as they can simul-taneously optimize the different objectives and find efficienttradeoffs unlike the classic techniques where the objectiveswere separately optimized and weighed based on the priorknowledge about the problem in hand The first pioneeringstudy on multiobjective evolutionary optimization came outin mideighties [18] In subsequent years several differentevolutionary algorithms (VEGA [19] MOGA [20] NPGA[21] NSGA [22] NSGA-II [4] SPEA [23] SPEA-II [19] PAES[24] PESA [25]) have successfully been implemented to solvethe classic optimization problems for example the singlesource shortest path problem [26] the all-pairs shortest path

problem [27] the multiobjective shortest path problem [28]the travelling salesman problem [29] the knapsack problem[30] and so forth Recently new evolutionary techniques forexample particle swarm optimization [31] artificial immunesystems [32] frog leaping algorithm [33] ant colony opti-mization [34] and so forth have been successfully appliedto the multiobjective optimization paradigm

This paper addresses the distributed query plan gen-eration (DQPG) problem given in [3] This problem isbased on a heuristic that favors query plans involving lessnumber of sites participating to retrieve the results Furtherquery plans involving smaller relations transmitted over lesscostly communication channels would incur less communi-cation costs and are thus favored over others Query plansgenerated based on this heuristic would result in efficientquery processing This DQPG problem was formulated andsolved as a single objective optimization problem in [3]Since this DQPG heuristic comprises minimization of boththe local processing cost and the communication cost anattempt has been made in this paper to minimize these costssimultaneouslyThat is theDQPGproblem is formulated as abiobjective optimization problem comprising two objectivesnamely minimization of the total local processing cost andminimization of the total communication cost In this paperthis problem has been solved using themultiobjective geneticalgorithm NSGA-II (nondominated sorting genetic algo-rithm) [4] The proposed NSGA-II based DQPG algorithmattempts to simultaneously minimize the two objectives withthe aim of achieving an acceptable tradeoff amongst them Itis shown that the optimization of total query processing costusing the proposed algorithm gives considerable improve-ment with respect to the time taken to converge and thequality of solutions with respect to total query processingcost when compared to the single objective GA based DQPGalgorithm given in [3]

This paper is organized as follows Section 2 discussesthe DQPG problem and its solution using the simple geneticalgorithm (SGA) given in [3] Section 3 discusses DQPGusing the multiobjective genetic algorithm An exampleillustrating the use of the proposed NSGA-II based DQPGalgorithm for generating optimal query plans for a distributedquery is given in Section 4The experimental results are givenin Section 5 Section 6 is the conclusion

2 DQPG Using SGA

This paper addresses the DQPG problem given in [3] solvedusing SGATheDQPGproblem is discussed next followed bya brief example describing the underlying methodology

21 The DQPG Problem Query plan generation is a keydeterminant for the efficient processing of a distributed queryThis necessitates devising a query plan generation strategythat would result in efficient query processing This strategywould require minimizing the total cost of query processingThe total cost incurred comprises the joint cost that is the costincurred in processing the query locally at the individual sitesand the cost of communicating the relation fragments among

The Scientific World Journal 3

the sites A distributed query processing strategy is given in[3] which aims to minimize the total query processing cost(TC) given below [3]

TC =

119898

sum

119894=1

LPC119894times 119886119894+

119894=119898minus1 119895=119898

sum

119894=1 119895=119894+1

CC119894119895times 119887119894 (1)

where LPC119894is the local processing cost per byte at site 119894 CC

119894119895

is the communication cost per byte between sites 119894 and 119895119886119894is the bytes to be processed at site 119894 119887

119894is the bytes to be

communicated from site 119894 and119898 is the total number of sitesFor each relation 119877

119905 Card(119877

119905) represents its cardinality and

Size(119877119905) represents the size of a single tuple in bytes At each

site the relations are integrated on common attributes usingthe equijoin operator to arrive at a single relation [3]

For relations 119877119905 with cardinality Card(119877

119905) and 119877

119904with

cardinality Card(119877119904) at site 119894 the cardinality Card

119894of the

resultant relation 119877119894is given as [3]

Card119894=

Card (119877119905) times Card (119877

119904)

Dist119905119904timesmin (Card (119877

119905) Card (119877

119904))

(2)

where Dist119905119904is the number of distinct tuples in the smaller

relation among 119877119905and 119877

119904

The size of the resultant relation119877119894at site 119894 is given as [1 3]

Size119894= Size (119877

119905) + Size (119877

119904) (3)

For a given query plan the communication between sitesoccurs in the order starting from the site having a relationwith lower cardinality to the site having a relation withhigher cardinality [3] The communication cost CC

119894119895and

local processing cost LPC119894are known a priori

The number of bytes to be processed locally at site 119896 isgiven by 119886

119896[3]

119886119896= Card

119894times Size

119894 (4)

The number of bytes to be communicated from site 119895 tosite 119896 is given by 119887

119895[3]

119887119895= Card

119895times Size

119895 (5)

Distributed query plans based on the above heuristic isgenerated using simple GA (SGA) in [3] This SGA basedDQPG as given in [3] is discussed next

22 SGA Based DQPG As discussed above it is a very com-plex task to generate efficient query plans from among a largeset of possible query plans An SGA based DQPG strategybased on the heuristic defined above is given in [3] whichaims to minimize the total cost of query processing (TC)indicating the fitness of a particular solution as compared toothers in the population The algorithm considers relationsaccessed by the query crossover and mutation probabilityand the prespecified number of generations (119866) as inputand produces the Top-119870 query plans as output First thealgorithm randomly generates an initial population of validquery plans (chromosomes) where the size of a query plan

is equal to the number of relations accessed by the queryEach gene in a chromosome represents a relation and theordering of relations in a chromosome is in increasing orderof their cardinality The value of a gene is the site where thecorresponding relation resides As an example for a queryaccessing four relations (1198771 1198772 1198773 and 1198774) arranged in theincreasing order of cardinalities one of the encoding schemesfor the chromosome representation can be (1 1 4 3) implyingthat 1198771 and 1198772 are in site 1 1198773 is in site 4 and 1198774 is insite 3 The fitness (TC) value is computed for each of thequery plans and thereafter the query plans are selected forcrossover using the binary tournament selection technique[35] These selected query plans undergo random single-point crossover [15 36] with probability 119875

119888 and mutation

[15 36] with probability 119875119898 The resultant new population

replaces the old population and the above process is repeatedfor the prespecified number of generations 119866 Thereafter theTop-119870 query plans are produced as output In this paperthe above single objective DQPG problem is formulated andsolved as amultiobjectiveDQPGproblem aswill be discussednext

3 DQPG Using MultiobjectiveGenetic Algorithm

In this paper the single objective DQPG problem discussedabove is formulated as a biobjective DQPG problem Thisformulation is given next

31 Multiobjective DQPG Problem Formulation In the GAbased DQPG algorithm given in [3] there is a single objec-tive that is Minimize TC It can be observed that TC com-prises two costs namely the local processing cost incurredat participating sites that is total processing cost (TPC)and communication cost between the participating sites thatis total communication cost (TCC) Since minimizing TCwould require minimizing TPC and minimizing TCC thissingle objective (Minimize TC)DQPGproblem is formulatedas a biobjective DQPG problem comprising two objectives asMinimize TPC andMinimize TCC Consider

TCC =

119894=119906minus1 119895=119906

sum

119894=1 119895=119894+1

CC119894119895times 119887119894

TPC =

119906

sum

119894=1

LPC119894times 119886119894

(6)

where 119906 is the number of sites accessed by the queryplan in ascending order of cardinality per site CC

119894119895is the

communication cost per byte between sites 119894 and 119895 LPC119894is

the local processing cost per byte at site 119894 119887119894is the bytes to be

communicated from site 119894 and 119886119894is the bytes to be processed

at site 119894 CC119894119895 LPC

119894 119887119894 and 119886

119894are as discussed in Section 21

If a site contains a single relation its LPC is consideredzero TCC and TPC need to be minimized simultaneously toachieve an acceptable tradeoff

4 The Scientific World Journal

The abovemultiobjectiveDQPGproblemhas been solvedusing the multiobjective genetic algorithm which is dis-cussed next

32 Multiobjective Genetic Algorithms Conceptualization ofmultiobjective problems using veridical models has a greatresemblance to many real world engineering and designproblems that involve more than one coextensive and oftencompeting objectives that is maximize profit maximizethroughput minimize cost minimize response time and soforth In such a scenario no single solution can be termedas optimal as in the case of single objective optimizationproblems but rather a set of alternative solutions can bevisualized as a tradeoff between the different objectives underconsideration This set of solutions is regarded superior toothers in the search space as no other recordedavailablesolution can better optimize all the objectives consideredtogether [37ndash39]

Multiobjective optimization approaches can be broadlyclassified into three categories [37] The approaches in thefirst two categories can be termed as the classical optimiza-tion approaches which combine all objectives into a singlecomposite function using some combination of arithmeticoperators ormove all but one objective into the constraint setThe approaches in the first category have limitations in regardto appropriate selection of weights and designing functions inaccordance to the problem It wouldmandate the user to havea priori knowledge of the behavior of each objective functionto some extent for providing the range of values to objectivesso that none of them dominate the others which is not alwayspossible [17] This approach is generally denominated asaggregating functions and it has been implemented at severaloccasionswith relative success in situationswhere behavior ofthe objective function ismore or lesswell-known Someof theaggregating functions include the weighted sum approachgoal programming 120576-constraintmethod and so forth [40] Inthe second approach moving the objectives into a constraintset requires that the boundary values for each of the objectivesbe known a priori which is almost impossible In either of thetwo cases the optimization method returns a single solutionrather than a set of solutions giving possible tradeoffs andtherefore the quality of solution in these approaches greatlydepends upon the correct problem formulation If feasiblethese would be the most efficient and simplest approacheswhich would give atleast sub optimal results in most cases

The third approach overcomes the problems faced inthe classical optimization approaches and emphasizes thedevelopment of alternative techniques based on exploringthe complete set of nondominated solutions and therebyenabling the decision maker to choose among the differentalternatives This set of solutions is referred to as the Paretooptimal set [13] A Pareto optimal set can be formally definedas a set of solutions that are nondominated with respectto each other that is replacing one solution with anotherwithin the Pareto optimal set will invariably lead to a lossto one objective against a gain obtained in another objective[41] Pareto optimal sets can have varied sizes but usuallythe size increases with increase in the number of objectives

[37 40] They are more preferred over single solutionsas they closely resemble real world problems where thedecision maker makes a decision based on tradeoffs betweenmultiple objectives A number of techniques were formulatedto generate the Pareto optimal set for example simulatedannealing [14] Tabu search [42] ant colony optimization[34] and so forth The problem with these algorithms wasthatmost often they get struck at local optima and thus renderit infeasible to venture out for identifying new tradeoffsEvolutionary algorithms such as GA on the other hand seemto be especially suited for this task as they enable parallelexploration of different areas in the search space eventuallyexploiting the solutions attained using operators such ascrossover and mutation [13] It would enable determiningmore members of a Pareto optimal set in a single run insteadof a series of runs required in other blind search strategiesAlso the evolutionary algorithms require very little a prioriknowledge of the problem at hand and therefore are lesssusceptible to the typical shape and continuity of the ParetofrontThe Pareto front can be defined as the points that lie onthe boundary of the Pareto optimal region These algorithmsthus avoid convergence to a suboptimal solution [43]

Mathematically a multiobjective optimization problemwith 119898 decision variables and 119899 objectives can be definedwithout any loss of generality as amaximization orminimiza-tion problem given by [13 38]

minimize = 119891 (119909) = (1198911 (119909) 1198912 (119909) 119891119899 (119909))

where119909 = (1199091 1199092 119909119898) isin 119883

119910 = (1199101 1199102 119910119899) isin 119884

(7)

Here 119909 is the decision vector 119883 refers to the parameterspace 119910 is the objective vector and 119884 defines the objectivespace These objectives may be conflicting in nature that isimprovement in one may lead to deterioration in anotherSo it may become impossible to optimize all objectivessimultaneously in a single solution Instead the best tradeoffsolution would be of interest to a decision maker Thesesolutions form a Pareto optimal set which was initially coinedby Edgeworth and Pareto and is formally defined as [13 38]

ldquoA decision vector 119886 is said to be Pareto optimal if andonly if 119886 is nondominated regarding 119883 A decision vector 119886is said to be nondominated regarding a set 1198831015840 sube 119883 if andonly if there is no vector in 119883

1015840 which dominates 119886 Formallyit can be defined aslt1198861015840 isin 119883

1015840

119886

1015840

≺ 119886rdquoAlso a decision vector 119886 isin 119883 is said to dominate a

decision vector 119887 isin 119883 (also written as 119886 ≺ 119887) if and onlyif

forall119894 isin 1 2 119899 119891119894(119886) le 119891

119894(119887)

forall119895 isin 1 2 119899 119891119895(119886) le 119891

119895(119887)

(8)

Several multiobjective algorithms exist in the literature[4 18ndash25 37 40 41 44 45] of whichGA basedmultiobjectiveoptimization algorithms have been widely used for solvingmultiobjective optimization problems In this paper NSGA-II has been used to solve the DQPG problem NSGA-II willbe discussed next

The Scientific World Journal 5

f(x

2)

f(x1)

minus 1

+ 1

Figure 1 Crowding distance for solution ldquoVrdquo [4]

321 NSGA-II The basis of NSGA-II [4] lies in the non-dominated sorting genetic algorithm (NSGA) introduced bySrinivas and Deb [22] As the name suggests NSGA usesnondominated ranking for each individual in the populationand assigns them accordingly into nondominated frontsTheindividuals in the first front or the nondominated individualsare then assigned large dummy fitness values All individualsin the front shared this fitness value based on a sharing func-tion Next the individuals in the second nondominated frontare considered and similarly assigned a dummy fitness lowerthan the fitness assigned in the previous front This processcontinues till the entire population is classified into frontsSince the solutions in the first front have themaximumfitnessvalue their chances of selection increase and eventually morecopies of such solutions get passed on to the next generationHowever NSGA suffered from some drawbacks such as highcomputational complexity119874(119898119873

3

) nonelitist approach andthe requirement of specifying a shared parameter [4] Theselimitations were addressed in NSGA-II proposed by Deb etal [4] as an improved version of NSGA [22] It alleviatesthe drawbacks in NSGA by reducing the computationalcomplexity to 119874(119898119873

2

) Further it uses a parameter-lesssharing approach by using a crowding distance measurefor selection The crowding distance is an estimate of thedensity of solutions surrounding a particular solution in theobjective space In Figure 1 the crowding distance of solutionrepresented as point V is computed as the average distancebetween the two closest solutions represented as points V minus 1

and V + 1 on either side of the points V along each of theobjectives 119891(1199091) and 119891(1199092)

NSGA-II uses a crowded-comparison operator for selec-tion which takes into account both the nondomination rankof a query plan in the population and its crowding distanceThe nondominated solutions are preferred over dominatedsolutions and between two solutions having the same ranka solution that resides in the less crowded region is preferredthat is a solution for which the crowding distance is higherTheNSGA-II does not use any externalmemory but it ensureselitism by combining the best parents with the best offspringobtained [19] In this paper an NSGA-II based multiobjective

DQPG algorithm is used to compute optimal query plansfor a given distributed query This algorithm is discussednext

33 NSGA-II Based DQPG Algorithm The proposed NSGA-II based DQPG algorithm takes the relations given in theFROM clause of the distributed query as input It arrangesthese relations in increasing order of their cardinalities Itthen generates a fixed set of feasible query plans (chromo-somes) based on the possible combinations of sites in whichthese relations are residing Each gene in a chromosomerepresents a relation and is arranged in increasing orderof the corresponding relationrsquos cardinality The value of agene represents the site in which the corresponding relationresides For example suppose that a query posed by the userhas 4 relations (1198771 1198772 1198773 and 1198774) arranged in ascendingorderThe relation 1198771 is stored in sites 1198781 and 1198783 1198772 is storedin 11987811198773 is stored in 1198781 and 1198782 and1198774 is stored in 1198781Then theinitial population of feasible query plans (chromosomes) canbe (1 1 1 1) (3 1 1 1) (3 1 2 1) and (1 1 2 1) This definesthe encoding scheme for the given problem The proposedDQPG algorithm based on NSGA-II is given in Algorithm 1The steps involved in this algorithm are discussed as follows

Step 1 (Initialize the Population [4 46]) A random popula-tion of query plans is generated as per the encoding schemediscussed above

Step 2 (EvaluateQuery Plans on theObjective Functions)For each of the query plans in the population the TCC andTPC values are computed as given below

TCC =

119894=119906minus1 119895=119906

sum

119894=1 119895=119894+1

CC119894119895times 119887119894

TPC =

119906

sum

119894=1

LPC119894times 119886119894

(9)

where 119906 is the number of sites accessed by the queryplan in ascending order of cardinality per site CC

119894119895is the

communication cost per byte between sites 119894 and 119895 LPC119894

is the local processing cost per byte at site 119894 119887119894is the bytes

to be communicated from site 119894 and 119886119894is the bytes to be

processed at site 119894 The procedure to compute CC119894119895 LPC

119894

119887119894 and 119886

119894is given in Section 21 If a site contains a single

relation its LPC is considered zero TCC and TPC needto be minimized simultaneously to achieve an acceptabletradeoff

Step 3 (Perform Nondominated Sort [4 46]) On the givenpopulation a fast nondominated sorting is performed in thefollowing manner

Twoobjective functions are consideredThefirst objectiveis to minimize the total processing cost (TPC) and thesecond objective is to minimize the total communicationcost (TCC) NSGA-II attempts to find a tradeoff betweenthese two objectives that can result in minimum total queryprocessing cost (TC)

6 The Scientific World Journal

Input 119877119894 Relations participating in the query 119875

119888 Probability of crossover

119875119898 Probability of mutation 119866 Pre-defined number of generations 119875size Population Size

Output Top-n query plansMethodInitialize a random parent population of query plans PPwhere chromosome length ldquolenrdquo is the number of relations accessed in the query and the gene at the 119896th position in thechromosome represents the site of the 119896th relationWHILE generation le 119866 DOStep 1 Evaluate each query plan in PP on the following objective functions

1198981 Minimize TCC =

119894=119906minus1119895=119906

sum

119894=1119895=119894+1

CC119894119895times 119887119894

1198982 Minimize TPC =

119906

sum

119894=1

LPC119894times 119886119894

where 119906 is the number of sites accessed by the query plan in ascending order of cardinality per site CC119894119895is the communication

cost per byte between sites 119894 and 119895 LPC119894is the local processing cost per byte at site 119894 119887

119894is the bytes to be communicated from

site 119894 and 119886119894is the bytes to be processed at site 119894

Step 2 Perform Non-Dominated (ND) Sort on PP for ldquo1198981rdquo and ldquo119898

2rdquo separately and place each query plan (QP)

into corresponding ND fronts ldquo119894rdquo and sort the QPs within each ldquo

119894rdquo

Step 3 Evaluate Crowding Distance Function 119868(119889) for each objective functionAssign 119868(119889) = infin for smallest and highest values in each front ldquo

119894rdquo

For the remaining QPs 119868(119889) is calculated as

119868 (119889119896) = 119868 (119889

119896) +

119868(119896 + 1)119898minus 119868(119896 minus 1)

119898

119891

max119898

minus 119891

min119898

where 119868(119896)119898is the value of119898th objective function of 119896th query plan in Front

119894and 119891

max119898

and 119891

min119898

are themaximum and minimum values obtained for the objective function119898Step 4 Perform Selection from PP using binary tournament selection using crowded comparison operator (≺

119899)

Step 5 Perform random single point crossover on selected chromosomes with crossover probability 119875119888

Step 6 Apply mutation on resulting population with mutation probability 119875119898

Let the resulting child population be CPStep 7 Append CP into PP and let the resulting intermediate population be IPStep 8 Repeat Step 1 and Step 2 for population IPStep 9 Form the population PP for the next generation by picking query plans Front-wise from IP till thepopulation size = 119875sizeStep 10 Increment Generation by 1ENDDOReturn Top-119870 Query Plans from population PP

Algorithm 1 NSGA-II based DQPG algorithm

In order to performanondominated sort each query planis compared with every other query plan in the population tofind if it is dominated For each query plan ldquo119894rdquo the followingtwo entities are considered

(i) 119899119894 The number of query plans that dominate the

query plan 119894(ii) 119878119894 The set of query plans that query plan 119894 dominates

All query plans that have 119899119894= 0 are added to the set

1

Set = 1where is called the current front For each

element 119909 in the current front visit each member 119895 in the set119878119909and reduce the count of 119899

119895by 1 Now if 119899

119895gets reduced to

zero for some 119895 add it to the set 2 After evaluating all the

members of in a similar manner set = 2 This process

continues till all the query plans are assigned some frontThe fast nondominated sorting procedure takes the currentpopulation as input and produces a list of nondominatedfronts

119894as output

Step 4 (Density Estimation Using Crowding Distance [446]) After the nondominated sort the crowding distanceis computed for each query plan in

119894 Crowding distance

[4] is an estimate of the density of solutions surrounding aparticular solution point in the population It is defined asthe average distance of the two closest points on either sidesof the given point along each of the objectives The crowdingdistance 119868(119889119894119904119905119886119899119888119890) is computed in the following manner[4 46]

For each front 119894 let 119873 be the number of query plans

in front 119894 Initially the crowding distance for each query

plan in the front 119894is zero That is 119868(119889

119895) = 0 119895 =

1 119873 Next for each objective function the query plansin the front

119894are sorted based on their value of TPC (ie

the first objective function) and similarly also with respectto TCC (ie the second objective function) and placed in119868(119889119894119904119905119886119899119888119890) The query plans having the smallest and thehighest 119868(119889119894119904119905119886119899119888119890) values in both sets are assigned an infinite

The Scientific World Journal 7

Rejected

Crowding distance sort PP

CP

IP

PP

Nondominatedsort

Ŧ1

Ŧ2

Ŧ3

Figure 2 NSGA-II procedure preserving elitism [4]

value for 119868(119889119894119904119905119886119899119888119890) that is 119868(1198891) = infin and 119868(119889

119873) = infin For

remaining query plans that is 119896 = 2 119873 minus 1 119868(119889119894119904119905119886119899119888119890)is computed as follows [4 46]

119868 (119889119896) = 119868 (119889

119896) +

119868(119896 + 1)119898minus 119868(119896 minus 1)

119898

119891

max119898

minus 119891

min119898

(10)

where 119868(119896)119898is the value of 119898th objective function of 119896th

query plan in front 119894and 119891

max119898

and 119891

min119898

are the maximumand minimum values obtained for the objective function119898

Step 5 (Binary Tournament Selection) After assigning thecrowding distance to the query plans in each front a selectionprocess is carried outThe selection scheme used is the binarytournament selection and it is carried out using the crowdedcomparison operator (≺

119899) [4 46] It uses two parameters as

given below(i) 120588rank (nondomination rank)The query plans in front

119894will have 120588rank = 119894

(ii) 119868119894(119889119895) (crowding distance in front 119894) 119895 = 1 119873

The crowded comparison is performed as described below [446]

For any two query plans 1199021199011and 119902119901

2 1199021199011is selected if

(1199021199011≺1198991199021199012) 1199021199011≺1198991199021199012is true if either one of the following

holds(i) 120588rank(1199021199011) lt 120588rank(1199021199012)(ii) if 119902119901

1and 119902119901

2belong to the same front (ie if

120588rank(1199021199011) = 120588rank(1199021199012)) then 119868119894(1198891199021199011

) gt 119868119894(1198891199021199012

)

Step 6 (Crossover andMutation) Crossover is performed onthe selected query plans with a given crossover probability119875119888 It ensures proper exploration of the search space by

combining the best features of the parent query plans (chro-mosomes) Mutation is performed on the given populationwith a given probability 119875

119898 It randomly changes the site

(gene) in which the corresponding relation resides within aquery plan (chromosome) The mutated gene always takes arandom value from a set of valid sites for a particular relationAfter going through the above steps the first generationpopulation is formed NSGA-II follows a different methodto produce subsequent generations in order to incorporateelitism as described next

Table 1 Initial parent population PP

[2 1 5 1] [3 5 4 5]

[3 3 1 1] [3 1 1 1]

[3 3 3 3] [3 4 4 4]

[2 4 1 5] [2 3 3 3]

[3 1 1 3] [2 1 3 4]

Step 7 (Preserving Good Solutions (Elitism) [4]) In subse-quent generations the new population after each generationis combined with the parent population and a new interme-diate population IP is created of size PP + CP where PP is theparent population and CP is the child population as shown inFigure 2

The non-dominated sort is applied to this intermediatepopulation and fronts are formed as described in Step 3Finally the population for the next generation is formedby adding solutions from each front till the population sizeexceeds 119875 If the last front to be included was

119888 which led

to the population overflow then query plans in Front 119888are

selected based on their crowding distance measure (Step 4)in descending order until the population size exceeds 119875

The above steps are repeated for ldquo119866rdquo generations and theTop-119870 query plans are produced as output

An example illustrating the use of the above NSGA-IIbased DQPG algorithm to generate query plans for a givendistributed query is given next

4 An Example

Consider the site relation matrix the communication-cost matrix the local processing cost matrix the distinct-tuple matrix and the size matrix used to compute thefitness of query plans given in [3] and shown below inFigure 3 Suppose the initial parent population PP com-prises of 10 query plans given in Table 1 Consider aquery that accesses four relations (1198771 1198772 1198773 and 1198774)which are distributed among five sites (1198781 1198782 1198783 1198784 and1198785)

8 The Scientific World Journal

R1 R2 R3 R4300200 500 700

Cardinality

R1 R2 R3 R430 40 50 20

Size

S1 S2 S3 S4 S52 2 3 3 2

S1 S2 S3 S4 S5

S1 150 180 150 160

S2 165 185 170 170

S3 160 160 170 150

S4 150 150 180 160

S5 170 160 190 150

Communication costmatrix

R1 R2 R3 R4

R1 04 05 04 04

R2 05 02 02 03

R3 04 02 02 01

R4 04 03 01 03

matrix for join

R1 R2 R3 R4S1 0 1 1 1S2 1 0 0 0S3 1 1 1 1S4 0 1 1 1S5 0 0 1 1

matrixLPCi

Distinct-tuple- Relation-site

mdash

mdashmdash

mdash

mdash

Figure 3 Matrices used for computing fitness [3]

Table 2 Nondominated sort on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC 119878

119894119899119894

Front[1] [2 1 5 1] 18020000 3746700 [4 10] 1

2

[2] [3 3 1 1] 6720000 6006000 [4 10] 1 2

[3] [3 3 3 3] 0 4200000 [2 4 7 8 9 10] mdash 1

[4] [2 4 1 5] 43440000 70793333 [10] 6 3

[5] [3 1 1 3] 14000000 2112000 [1 4 6 10] mdash 1

[6] [3 5 4 5] 17020000 3847000 [4 10] 1 2

[7] [3 1 1 1] 960000 6500000 [4 8 9 10] 1 2

[8] [3 4 4 4] 1020000 9750000 [10] 2 3

[9] [2 3 3 3] 1110000 9750000 [10] 2 3

[10] [2 1 3 4] 45090000 10514000 mdash 9 4

Table 3 Sorted fronts based on TCC and TPC

Query plans sorted on1198981= TCC

Query plans sorted on1198982= TPC

1

3 5 1

5 32

7 2 6 1 2

1 6 2 73

8 9 4 3

4 8 94

10 4

10

The computations of TPC and TCC for the query plan[2 4 1 5] are given as follows

TPC4= (LPC

2times 1198862) + (LPC

4times 1198864)

+ (LPC5times 1198865) + (LPC

1times 1198861)

TCC4= (CC

24times 11988724

) + (CC45

times 11988745

) + (CC51

times 11988751

)

(11)

whereLPC2times 1198862= 0

CC24

times 11988724

= CC24

times (Card (1198771) times Size (1198771))

= 170 times (200 times 30) = 1020000

LPC4times 1198864= LPC

4times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)) )

= 2 times (

200 times 300

05 times 200

times (70)) = 126000

CC45

times 11988745

= CC45

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)))

= 170 times (

200 times 300

05 times 200

times (70)) = 6720000

LPC5times 1198865

= LPC5

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 420000

The Scientific World Journal 9

Table 4 Results for the objective functions on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC Front Crowding distance (CD) = 119868[119894]

[1] [2 1 5 1] 18020000 3746700 2

infin

[2] [3 3 1 1] 6720000 6006000 2

0672[3] [3 3 3 3] 0 4200000

1infin

[4] [2 4 1 5] 43440000 70793333 3

infin

[5] [3 1 1 3] 14000000 2112000 1

infin

[6] [3 5 4 5] 17020000 3847000 2

05195[7] [3 1 1 1] 960000 6500000

2infin

[8] [3 4 4 4] 1020000 9750000 3

infin

[9] [2 3 3 3] 1110000 9750000 3

infin

[10] [2 1 3 4] 45090000 10514000 4

infin

Table 5 Selection of query plans using binary tournament selection technique

IndexRandomly generated

indexes[119894] and [119895]

Tournament betweenquery plans

[119875(119894)] and [119875(119895)]

Front comparison Query plan selected (lowerfront or higher CD)

[1] [1] and [4] [2 1 5 1] and [2 4 1 5] [2] and [

3] [2 1 5 1]

[2] [2] and [3] [3 3 1 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[3] [3] and [4] [3 3 3 3] and [2 4 1 5] [1] and [

3] [3 3 3 3]

[4] [2] and [6] [3 3 1 1] and [3 5 4 5] [2] and [

2] [3 3 1 1]

[5] [2] and [4] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

[6] [1] and [3] [2 1 5 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[7] [2] and [5] [3 3 1 1] and [3 1 1 3] [2] and [

1] [3 1 1 3]

[8] [2] and [8] [3 3 1 1] and [3 1 1 3] [2] and [

3] [3 3 1 1]

[9] [7] and [8] [3 1 1 1] and [3 1 1 3] [2] and [

3] [3 1 1 1]

[10] [2] and [9] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

Table 6 Population CP after crossover and mutation

Index Population after crossover Population aftermutation (CP)

[1] [2 1 1 1] [2 1 1 1]

[2] [3 3 1 1] [3 3 1 1]

[3] [3 3 5 1] [3 3 4 1][4] [3 3 1 1] [3 3 1 1]

[5] [3 3 3 1] [3 3 3 1]

[6] [3 3 3 3] [3 3 3 3]

[7] [3 1 1 3] [3 1 1 3]

[8] [3 3 1 3] [3 3 1 3]

[9] [3 1 3 3] [3 3 3 3]

[10] [3 3 1 1] [3 3 1 1]

CC51

times 11988751

= CC51

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

170 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 35700000

LPC1times 1198861

= LPC1times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times

Card (1198773)

Dist43

times Card (1198773)

times (

4

sum

119894=1

Size (119877119894)))

10 The Scientific World Journal

Table 7 Front assignment for query plans in IP

Index Intermediate population IP TCC TPC Front[1] [2 1 5 1] 18020000 3747000

2

[2] [3 3 1 1] 6720000 6006000 3

[3] [3 3 3 3] 0 4200000 1

[4] [2 4 1 5] 43440000 7079000 4

[5] [3 1 1 3] 14000000 2112000 1

[6] [3 5 4 5] 17020000 3847000 2

[7] [3 1 1 1] 960000 6500000 2

[8] [3 1 1 3] 1020000 9750000 3

[9] [2 3 3 3] 1110000 9750000 3

[10] [2 1 3 4] 45090000 10514000 5

[11] [2 1 1 1] 990000 6500000 2

[12] [3 3 1 1] 6720000 6006000 3

[13] [3 3 4 1] 90300000 8946000 5

[14] [3 3 1 1] 6720000 6006000 3

[15] [3 3 3 1] 2520000 4230000 2

[16] [3 3 3 3] 0 4200000 1

[17] [3 1 1 3] 14000000 2112000 1

[18] [3 3 1 3] 4500000 2640000 1

[19] [3 3 3 3] 0 4200000 1

[20] [3 3 1 1] 6720000 6006000 3

Table 8 Nondominated sorting for query plans in IP

Index Intermediatepopulation IP Front Crowding

distance[3] [3 3 3 3]

1

[5] [3 1 1 3] 1

[16] [3 3 3 3] 1

[17] [3 1 1 3] 1

[18] [3 3 1 3] 1

[19] [3 3 3 3] 1

[1] [2 1 5 1] 2 infin

[7] [3 1 1 1] 2 infin

[11] [2 1 1 1] 2 infin

[15] [3 3 3 1] 2 04933

[6] [3 5 4 5] 2 02292

[2] [3 3 1 1] 3

[9] [2 3 3 3] 3

[8] [3 1 1 3] 3

[12] [3 3 1 1] 3

[14] [3 3 1 1] 3

[20] [3 3 1 1] 3

[4] [2 4 1 5] 4

[10] [2 1 3 4] 5

[13] [3 3 4 1] 4

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

times

500

01 times 500

) times 140 = 65333333

(12)

Table 9 Population PP in the 2nd generation

Index Population PP[1] [3 3 3 3]

[2] [3 1 1 3]

[3] [3 3 3 3]

[4] [3 1 1 3]

[5] [3 3 1 3]

[6] [3 3 3 3]

[7] [2 1 5 1]

[8] [3 1 1 1]

[9] [2 1 1 1]

[10] [3 3 3 1]

So

TPC4= 126000 + 420000 + 65333333 = 70793333

TCC4= 1020000 + 6720000 + 35700000 = 43440000

(13)

Similarly TCC and TPC values of the other nine queryplans are computed Consider TCC and TPC of the 10 queryplans are given in Table 2 The population is then sorted intodifferent nondominated fronts as described in Step 3 of theproposed algorithm For example for query plan 1 that is[2 1 5 1] the set 119878

1= number of query plans that dominate

query plan 1 Since TCC [1] lt TCC [4] TPC [1] lt TPC [4]TCC [1] lt TCC [10] and TPC [1] lt TPC [10] the elementsof 1198781

= 4 10 Similarly the sets 1198782 119878

10are computed

and are given in Table 2 119899119894stores the count of query plans

that dominate 119894 So using the values in 119878119894 1198991

= 1 as onlyquery plan 5 is dominating 1 and 119899

2= 1 as only query plan 3

is dominating 2 Similarly 1198992 119899

10are computed and are

The Scientific World Journal 11

given in Table 2 From Table 2 it can be noted that queryplans 3 and 5 are not dominated as 119899

3= 0 119899

5= 0 So they

are assigned to the first nondominated front1The elementsin the next front are computed by reducing the count in 119899

119894

for each 119894 isin 1198783and 119894 isin 119878

5 So 119899

119894= 0 0 minus 4 minus 0 0 1 1 7

So the second front has query plans 1 2 6 and 7 Thisprocess continues till all the query plans in the population areassigned to their respective nondominated fronts The fronts1 2 3 and

4are formed and are given in Table 2

Finally the query plans are sorted separately on the valuesof TCC and TPC within each front as shown in Table 3

After the population is sorted into different fronts thecrowding distance (119868[119889119894119904119905119886119899119888119890]) computation is performedfor each query plan using the formula given in Step 4 ofthe proposed algorithm Query plans having the maximumand minimum values in each front are assigned infin distancevalues that is 119868[3] 119868[5] 119868[7] 119868[1] 119868[8] 119868[4] 119868[9] and119868[10] are assigned infin For the rest of the query plans theircrowding distance (CD) values are computed based on thesorted order of query plans in each front Initially they areassigned 119868[119889119894119904119905119886119899119888119890] = 0 For query plan 2 in front

2 the

CD computation is performed as follows

1198681198981[2] = 119868 [2] +

TCC [6] minus TCC [7]

max (TCC) minusmin (TCC)

= 0 +

17020000 minus 960000

45090000 minus 0

= 03562

119868 [2] = 1198681198981[2] +

TPC [7] minus TPC [6]

max (TPC) minusmin (TPC)

= 03562 +

6500000 minus 3847000

10514000 minus 2112000

= 06720

(14)

CD of the query plans in the given population is given inTable 4

Next binary tournament selection is performed on thepopulation on the basis of crowded comparison operator ≺

119899

This selection process is shown in Table 5The selected query plans undergo random single point

crossover with crossover probability 119875119888

= 05 Mutation isperformed on the selected population with mutation proba-bility 119875

119898= 002 The child population CP after crossover and

mutation is shown in Table 6Now in accordance with NSGA-II algorithm the popu-

lations from the second generation onward have to ensureelitism For this purpose the child population CP is com-bined with the parent population PP to generate intermediatepopulation IP This population is subjected to nondominatedsort and fronts are formed as given in Table 7

The population for the second generation is arrived atby selecting query plans based on front and within it basedon crowding distance-wise as described in Step 7 from theintermediate population IP till the actual population size 10is exceeded This selection is shown in Table 8

The population PP for the second generation is given inTable 9

The above process is repeated till a predefined number ofgenerations119866have elapsedThereafter Top-119870 query plans areproduced as output

0

01

02

03

04

05

06

07

08

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-5 query plans)

Figure 4

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

DQPGNSGA(top-10 query plans)

Figure 5

5 Experimental Results

The proposed NSGA-II based algorithm is implemented inMATLAB 77 in Windows 7 professional 64 bit OS with Intelcore i3CPUat 213GHzhaving 4GBRAMExperimentswerecarried out for a population of 100 query plans with eachquery plan involving 10 relations distributed over 50 sitesThese were performed on four datasets each comprising adifferent relation-site matrix Graphs were plotted to observechange in average TC (ATC) with respect to generations

12 The Scientific World Journal

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-15 query plans)

Figure 6

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-20 query plans)

Figure 7

and Top-119870 query plans for different pairs of crossover andmutation rates

First graphs showing the ATC values Top-5 Top-10Top-15 and Top-20 query plans averaged over four datasetsover 1000 generations for distinct pairs of crossover andmutation probability 119875

119888 119875119898 = 07 001 07 0005

075 001 075 0005 08 001 08 0005 085 001085 0005 using NSGA-II based DQPG algorithm(DQPGNSGA) were plotted These are shown in Figures 45 6 and 7 It can be observed from these graphs that the

0

005

01

015

02

025

03

035

04

045

05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(after 1000 generations)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGNSGA

Top-K query plans

Figure 8

0

01

02

03

04

05

06

07

08

09

1

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-5 query plans)

Figure 9

convergence to ATC is lowest in the case of 119875119888 119875119898 =

085 001 Furthermore the graph showing ATC valuesaveraged over four datasets versus Top-119870 query plans after1000 generations (Figure 8) also shows that the lowest ATCvalues achieved are for 119875

119888 119875119898 = 085 001 Thus it

can be said that DQPGNSGA performs reasonably well for119875119888 119875119898 = 085 001 In order to compare DQPGNSGA with

SGA based DQPG algorithm (DQPGSGA) similar graphswere plotted for DQPGSGA These are shown in Figures 9 1011 12 and 13 It is noted from these graphs that DQPGSGA

The Scientific World Journal 13

0

02

04

06

08

1

12

14

1618

21 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-10 query plans)

Figure 10

0

02

04

06

08

1

12

14

16

18

2

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-15 query plans)

Figure 11

also achieves convergence to the lowest ATC values for119875119888 119875119898 = 085 001

Since the two algorithms DQPGNSGA and DQPGSGAconverge to a lower ATC value for the same crossover andmutation probabilities that is 119875

119888 119875119898 = 085 001 the

comparisons of the two algorithms can be carried out forthese observed probabilities

First the two algorithmsDQPGNSGA andDQPGSGA werecompared for each of the four datasets (Dataset-1 Dataset-2 Dataset-3 and Dataset-4) on the ATC values of Top-5Top-10 Top-15 and Top-20 query plans generated over 1000

0

04

08

12

16

2

24

28

32

36

4

1 54 107

160

213

266

319

372

425

478

531

584

637

690

743

796

849

902

955

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-20 query plans)

Figure 12

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

(after 1000 generations)

Aver

age t

otal

cost

(ATC

)

DQPGSGA

Top-K query plans

Figure 13

generations for 119875119888 119875119898 = 085 001 The corresponding

graphs for each dataset are shown in Figures 14 15 16 and 17It can be observed from the graphs thatDQPGNSGA convergesto lower ATC values for Top-5 Top-10 Top-15 and Top-20query plans generated for all datasets

Furthermore graphs comparing the ATC values of Top-119870 query plans produced by DQPGNSGA and DQPGSGA after1000 generations for the four datasets were plotted andare shown in Figures 18 19 20 and 21 These graphs alsoshow average TCC (ATCC) and average TPC (ATPC) valuesof Top-119870 query plans generated by DQPGNSGA It can beobserved from these graphs thatDQPGNSGA is able to achieve

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

2 The Scientific World Journal

the overall query processing cost It thus also plays a key rolein determining the overall performance of a DDS There canbe a number of possible ways to process and communicaterelation fragments involved in the query A distributed queryprocessing strategy evaluates all possible sequence of sitescorresponding to relations accessed in the query referredto as a query plan and determines the most optimal queryplan that minimizes the total cost that is local processing(CPU IO) cost and communication cost [5ndash11]The numberof possible query plans grows at least exponentially withthe increase in number of relations accessed by the query[12 13]This number increases further if relations accessed bythe query are replicated across multiple sites Performing anexhaustive search on all possible combinations of query plansis not feasible due to a vast search space Therefore in largeDDS devising a query processing strategy that optimizes thetotal query processing cost is shown to be a combinatorialoptimization problem [10]

Over the last three decades many algorithms and tech-niques have been devised to solve the class of combinatorialoptimization problems Initially the rigorous mathemati-cal and search based techniques like simulated annealingrandom search algorithms dynamic programming and soforth were used to solve such problems which thoughworked well with moderate sized problems on cost heuristiccould not succeed with complex multiobjective problemsThese mechanisms suffered from a drawback at certaininstances where they converged to local optima withoutexploring the entire search space [8 13 14] However inthe last two decades evolutionary techniques have gainedimmense popularity due to their applicability in solving thesecomplex scientific and engineering optimization problemsThese algorithms are inspired by the Darwinian evolutionthat accentuates the concept of ldquoSurvival of the Fittestrdquo [15]It is thus metaphorical to the natural social behavior andbiological evolution of species The evolutionary techniquesare now proved to be themost proficientmethod of choice forsolving such problems Genetic algorithm based techniqueswhich belong to the class of evolutionary algorithms havealso been widely used in solving complex real life science andengineering problemsThe strength of GA as a metaheuristiccomes from its ability to combine the good features fromseveral solutions to create new and better solutions [16 17]over generations

Most real world scientific and engineering problemshave often conflicting and competing objectives that needto be optimized The evolutionary strategies are proved tobe best suited for this class of problems as they can simul-taneously optimize the different objectives and find efficienttradeoffs unlike the classic techniques where the objectiveswere separately optimized and weighed based on the priorknowledge about the problem in hand The first pioneeringstudy on multiobjective evolutionary optimization came outin mideighties [18] In subsequent years several differentevolutionary algorithms (VEGA [19] MOGA [20] NPGA[21] NSGA [22] NSGA-II [4] SPEA [23] SPEA-II [19] PAES[24] PESA [25]) have successfully been implemented to solvethe classic optimization problems for example the singlesource shortest path problem [26] the all-pairs shortest path

problem [27] the multiobjective shortest path problem [28]the travelling salesman problem [29] the knapsack problem[30] and so forth Recently new evolutionary techniques forexample particle swarm optimization [31] artificial immunesystems [32] frog leaping algorithm [33] ant colony opti-mization [34] and so forth have been successfully appliedto the multiobjective optimization paradigm

This paper addresses the distributed query plan gen-eration (DQPG) problem given in [3] This problem isbased on a heuristic that favors query plans involving lessnumber of sites participating to retrieve the results Furtherquery plans involving smaller relations transmitted over lesscostly communication channels would incur less communi-cation costs and are thus favored over others Query plansgenerated based on this heuristic would result in efficientquery processing This DQPG problem was formulated andsolved as a single objective optimization problem in [3]Since this DQPG heuristic comprises minimization of boththe local processing cost and the communication cost anattempt has been made in this paper to minimize these costssimultaneouslyThat is theDQPGproblem is formulated as abiobjective optimization problem comprising two objectivesnamely minimization of the total local processing cost andminimization of the total communication cost In this paperthis problem has been solved using themultiobjective geneticalgorithm NSGA-II (nondominated sorting genetic algo-rithm) [4] The proposed NSGA-II based DQPG algorithmattempts to simultaneously minimize the two objectives withthe aim of achieving an acceptable tradeoff amongst them Itis shown that the optimization of total query processing costusing the proposed algorithm gives considerable improve-ment with respect to the time taken to converge and thequality of solutions with respect to total query processingcost when compared to the single objective GA based DQPGalgorithm given in [3]

This paper is organized as follows Section 2 discussesthe DQPG problem and its solution using the simple geneticalgorithm (SGA) given in [3] Section 3 discusses DQPGusing the multiobjective genetic algorithm An exampleillustrating the use of the proposed NSGA-II based DQPGalgorithm for generating optimal query plans for a distributedquery is given in Section 4The experimental results are givenin Section 5 Section 6 is the conclusion

2 DQPG Using SGA

This paper addresses the DQPG problem given in [3] solvedusing SGATheDQPGproblem is discussed next followed bya brief example describing the underlying methodology

21 The DQPG Problem Query plan generation is a keydeterminant for the efficient processing of a distributed queryThis necessitates devising a query plan generation strategythat would result in efficient query processing This strategywould require minimizing the total cost of query processingThe total cost incurred comprises the joint cost that is the costincurred in processing the query locally at the individual sitesand the cost of communicating the relation fragments among

The Scientific World Journal 3

the sites A distributed query processing strategy is given in[3] which aims to minimize the total query processing cost(TC) given below [3]

TC =

119898

sum

119894=1

LPC119894times 119886119894+

119894=119898minus1 119895=119898

sum

119894=1 119895=119894+1

CC119894119895times 119887119894 (1)

where LPC119894is the local processing cost per byte at site 119894 CC

119894119895

is the communication cost per byte between sites 119894 and 119895119886119894is the bytes to be processed at site 119894 119887

119894is the bytes to be

communicated from site 119894 and119898 is the total number of sitesFor each relation 119877

119905 Card(119877

119905) represents its cardinality and

Size(119877119905) represents the size of a single tuple in bytes At each

site the relations are integrated on common attributes usingthe equijoin operator to arrive at a single relation [3]

For relations 119877119905 with cardinality Card(119877

119905) and 119877

119904with

cardinality Card(119877119904) at site 119894 the cardinality Card

119894of the

resultant relation 119877119894is given as [3]

Card119894=

Card (119877119905) times Card (119877

119904)

Dist119905119904timesmin (Card (119877

119905) Card (119877

119904))

(2)

where Dist119905119904is the number of distinct tuples in the smaller

relation among 119877119905and 119877

119904

The size of the resultant relation119877119894at site 119894 is given as [1 3]

Size119894= Size (119877

119905) + Size (119877

119904) (3)

For a given query plan the communication between sitesoccurs in the order starting from the site having a relationwith lower cardinality to the site having a relation withhigher cardinality [3] The communication cost CC

119894119895and

local processing cost LPC119894are known a priori

The number of bytes to be processed locally at site 119896 isgiven by 119886

119896[3]

119886119896= Card

119894times Size

119894 (4)

The number of bytes to be communicated from site 119895 tosite 119896 is given by 119887

119895[3]

119887119895= Card

119895times Size

119895 (5)

Distributed query plans based on the above heuristic isgenerated using simple GA (SGA) in [3] This SGA basedDQPG as given in [3] is discussed next

22 SGA Based DQPG As discussed above it is a very com-plex task to generate efficient query plans from among a largeset of possible query plans An SGA based DQPG strategybased on the heuristic defined above is given in [3] whichaims to minimize the total cost of query processing (TC)indicating the fitness of a particular solution as compared toothers in the population The algorithm considers relationsaccessed by the query crossover and mutation probabilityand the prespecified number of generations (119866) as inputand produces the Top-119870 query plans as output First thealgorithm randomly generates an initial population of validquery plans (chromosomes) where the size of a query plan

is equal to the number of relations accessed by the queryEach gene in a chromosome represents a relation and theordering of relations in a chromosome is in increasing orderof their cardinality The value of a gene is the site where thecorresponding relation resides As an example for a queryaccessing four relations (1198771 1198772 1198773 and 1198774) arranged in theincreasing order of cardinalities one of the encoding schemesfor the chromosome representation can be (1 1 4 3) implyingthat 1198771 and 1198772 are in site 1 1198773 is in site 4 and 1198774 is insite 3 The fitness (TC) value is computed for each of thequery plans and thereafter the query plans are selected forcrossover using the binary tournament selection technique[35] These selected query plans undergo random single-point crossover [15 36] with probability 119875

119888 and mutation

[15 36] with probability 119875119898 The resultant new population

replaces the old population and the above process is repeatedfor the prespecified number of generations 119866 Thereafter theTop-119870 query plans are produced as output In this paperthe above single objective DQPG problem is formulated andsolved as amultiobjectiveDQPGproblem aswill be discussednext

3 DQPG Using MultiobjectiveGenetic Algorithm

In this paper the single objective DQPG problem discussedabove is formulated as a biobjective DQPG problem Thisformulation is given next

31 Multiobjective DQPG Problem Formulation In the GAbased DQPG algorithm given in [3] there is a single objec-tive that is Minimize TC It can be observed that TC com-prises two costs namely the local processing cost incurredat participating sites that is total processing cost (TPC)and communication cost between the participating sites thatis total communication cost (TCC) Since minimizing TCwould require minimizing TPC and minimizing TCC thissingle objective (Minimize TC)DQPGproblem is formulatedas a biobjective DQPG problem comprising two objectives asMinimize TPC andMinimize TCC Consider

TCC =

119894=119906minus1 119895=119906

sum

119894=1 119895=119894+1

CC119894119895times 119887119894

TPC =

119906

sum

119894=1

LPC119894times 119886119894

(6)

where 119906 is the number of sites accessed by the queryplan in ascending order of cardinality per site CC

119894119895is the

communication cost per byte between sites 119894 and 119895 LPC119894is

the local processing cost per byte at site 119894 119887119894is the bytes to be

communicated from site 119894 and 119886119894is the bytes to be processed

at site 119894 CC119894119895 LPC

119894 119887119894 and 119886

119894are as discussed in Section 21

If a site contains a single relation its LPC is consideredzero TCC and TPC need to be minimized simultaneously toachieve an acceptable tradeoff

4 The Scientific World Journal

The abovemultiobjectiveDQPGproblemhas been solvedusing the multiobjective genetic algorithm which is dis-cussed next

32 Multiobjective Genetic Algorithms Conceptualization ofmultiobjective problems using veridical models has a greatresemblance to many real world engineering and designproblems that involve more than one coextensive and oftencompeting objectives that is maximize profit maximizethroughput minimize cost minimize response time and soforth In such a scenario no single solution can be termedas optimal as in the case of single objective optimizationproblems but rather a set of alternative solutions can bevisualized as a tradeoff between the different objectives underconsideration This set of solutions is regarded superior toothers in the search space as no other recordedavailablesolution can better optimize all the objectives consideredtogether [37ndash39]

Multiobjective optimization approaches can be broadlyclassified into three categories [37] The approaches in thefirst two categories can be termed as the classical optimiza-tion approaches which combine all objectives into a singlecomposite function using some combination of arithmeticoperators ormove all but one objective into the constraint setThe approaches in the first category have limitations in regardto appropriate selection of weights and designing functions inaccordance to the problem It wouldmandate the user to havea priori knowledge of the behavior of each objective functionto some extent for providing the range of values to objectivesso that none of them dominate the others which is not alwayspossible [17] This approach is generally denominated asaggregating functions and it has been implemented at severaloccasionswith relative success in situationswhere behavior ofthe objective function ismore or lesswell-known Someof theaggregating functions include the weighted sum approachgoal programming 120576-constraintmethod and so forth [40] Inthe second approach moving the objectives into a constraintset requires that the boundary values for each of the objectivesbe known a priori which is almost impossible In either of thetwo cases the optimization method returns a single solutionrather than a set of solutions giving possible tradeoffs andtherefore the quality of solution in these approaches greatlydepends upon the correct problem formulation If feasiblethese would be the most efficient and simplest approacheswhich would give atleast sub optimal results in most cases

The third approach overcomes the problems faced inthe classical optimization approaches and emphasizes thedevelopment of alternative techniques based on exploringthe complete set of nondominated solutions and therebyenabling the decision maker to choose among the differentalternatives This set of solutions is referred to as the Paretooptimal set [13] A Pareto optimal set can be formally definedas a set of solutions that are nondominated with respectto each other that is replacing one solution with anotherwithin the Pareto optimal set will invariably lead to a lossto one objective against a gain obtained in another objective[41] Pareto optimal sets can have varied sizes but usuallythe size increases with increase in the number of objectives

[37 40] They are more preferred over single solutionsas they closely resemble real world problems where thedecision maker makes a decision based on tradeoffs betweenmultiple objectives A number of techniques were formulatedto generate the Pareto optimal set for example simulatedannealing [14] Tabu search [42] ant colony optimization[34] and so forth The problem with these algorithms wasthatmost often they get struck at local optima and thus renderit infeasible to venture out for identifying new tradeoffsEvolutionary algorithms such as GA on the other hand seemto be especially suited for this task as they enable parallelexploration of different areas in the search space eventuallyexploiting the solutions attained using operators such ascrossover and mutation [13] It would enable determiningmore members of a Pareto optimal set in a single run insteadof a series of runs required in other blind search strategiesAlso the evolutionary algorithms require very little a prioriknowledge of the problem at hand and therefore are lesssusceptible to the typical shape and continuity of the ParetofrontThe Pareto front can be defined as the points that lie onthe boundary of the Pareto optimal region These algorithmsthus avoid convergence to a suboptimal solution [43]

Mathematically a multiobjective optimization problemwith 119898 decision variables and 119899 objectives can be definedwithout any loss of generality as amaximization orminimiza-tion problem given by [13 38]

minimize = 119891 (119909) = (1198911 (119909) 1198912 (119909) 119891119899 (119909))

where119909 = (1199091 1199092 119909119898) isin 119883

119910 = (1199101 1199102 119910119899) isin 119884

(7)

Here 119909 is the decision vector 119883 refers to the parameterspace 119910 is the objective vector and 119884 defines the objectivespace These objectives may be conflicting in nature that isimprovement in one may lead to deterioration in anotherSo it may become impossible to optimize all objectivessimultaneously in a single solution Instead the best tradeoffsolution would be of interest to a decision maker Thesesolutions form a Pareto optimal set which was initially coinedby Edgeworth and Pareto and is formally defined as [13 38]

ldquoA decision vector 119886 is said to be Pareto optimal if andonly if 119886 is nondominated regarding 119883 A decision vector 119886is said to be nondominated regarding a set 1198831015840 sube 119883 if andonly if there is no vector in 119883

1015840 which dominates 119886 Formallyit can be defined aslt1198861015840 isin 119883

1015840

119886

1015840

≺ 119886rdquoAlso a decision vector 119886 isin 119883 is said to dominate a

decision vector 119887 isin 119883 (also written as 119886 ≺ 119887) if and onlyif

forall119894 isin 1 2 119899 119891119894(119886) le 119891

119894(119887)

forall119895 isin 1 2 119899 119891119895(119886) le 119891

119895(119887)

(8)

Several multiobjective algorithms exist in the literature[4 18ndash25 37 40 41 44 45] of whichGA basedmultiobjectiveoptimization algorithms have been widely used for solvingmultiobjective optimization problems In this paper NSGA-II has been used to solve the DQPG problem NSGA-II willbe discussed next

The Scientific World Journal 5

f(x

2)

f(x1)

minus 1

+ 1

Figure 1 Crowding distance for solution ldquoVrdquo [4]

321 NSGA-II The basis of NSGA-II [4] lies in the non-dominated sorting genetic algorithm (NSGA) introduced bySrinivas and Deb [22] As the name suggests NSGA usesnondominated ranking for each individual in the populationand assigns them accordingly into nondominated frontsTheindividuals in the first front or the nondominated individualsare then assigned large dummy fitness values All individualsin the front shared this fitness value based on a sharing func-tion Next the individuals in the second nondominated frontare considered and similarly assigned a dummy fitness lowerthan the fitness assigned in the previous front This processcontinues till the entire population is classified into frontsSince the solutions in the first front have themaximumfitnessvalue their chances of selection increase and eventually morecopies of such solutions get passed on to the next generationHowever NSGA suffered from some drawbacks such as highcomputational complexity119874(119898119873

3

) nonelitist approach andthe requirement of specifying a shared parameter [4] Theselimitations were addressed in NSGA-II proposed by Deb etal [4] as an improved version of NSGA [22] It alleviatesthe drawbacks in NSGA by reducing the computationalcomplexity to 119874(119898119873

2

) Further it uses a parameter-lesssharing approach by using a crowding distance measurefor selection The crowding distance is an estimate of thedensity of solutions surrounding a particular solution in theobjective space In Figure 1 the crowding distance of solutionrepresented as point V is computed as the average distancebetween the two closest solutions represented as points V minus 1

and V + 1 on either side of the points V along each of theobjectives 119891(1199091) and 119891(1199092)

NSGA-II uses a crowded-comparison operator for selec-tion which takes into account both the nondomination rankof a query plan in the population and its crowding distanceThe nondominated solutions are preferred over dominatedsolutions and between two solutions having the same ranka solution that resides in the less crowded region is preferredthat is a solution for which the crowding distance is higherTheNSGA-II does not use any externalmemory but it ensureselitism by combining the best parents with the best offspringobtained [19] In this paper an NSGA-II based multiobjective

DQPG algorithm is used to compute optimal query plansfor a given distributed query This algorithm is discussednext

33 NSGA-II Based DQPG Algorithm The proposed NSGA-II based DQPG algorithm takes the relations given in theFROM clause of the distributed query as input It arrangesthese relations in increasing order of their cardinalities Itthen generates a fixed set of feasible query plans (chromo-somes) based on the possible combinations of sites in whichthese relations are residing Each gene in a chromosomerepresents a relation and is arranged in increasing orderof the corresponding relationrsquos cardinality The value of agene represents the site in which the corresponding relationresides For example suppose that a query posed by the userhas 4 relations (1198771 1198772 1198773 and 1198774) arranged in ascendingorderThe relation 1198771 is stored in sites 1198781 and 1198783 1198772 is storedin 11987811198773 is stored in 1198781 and 1198782 and1198774 is stored in 1198781Then theinitial population of feasible query plans (chromosomes) canbe (1 1 1 1) (3 1 1 1) (3 1 2 1) and (1 1 2 1) This definesthe encoding scheme for the given problem The proposedDQPG algorithm based on NSGA-II is given in Algorithm 1The steps involved in this algorithm are discussed as follows

Step 1 (Initialize the Population [4 46]) A random popula-tion of query plans is generated as per the encoding schemediscussed above

Step 2 (EvaluateQuery Plans on theObjective Functions)For each of the query plans in the population the TCC andTPC values are computed as given below

TCC =

119894=119906minus1 119895=119906

sum

119894=1 119895=119894+1

CC119894119895times 119887119894

TPC =

119906

sum

119894=1

LPC119894times 119886119894

(9)

where 119906 is the number of sites accessed by the queryplan in ascending order of cardinality per site CC

119894119895is the

communication cost per byte between sites 119894 and 119895 LPC119894

is the local processing cost per byte at site 119894 119887119894is the bytes

to be communicated from site 119894 and 119886119894is the bytes to be

processed at site 119894 The procedure to compute CC119894119895 LPC

119894

119887119894 and 119886

119894is given in Section 21 If a site contains a single

relation its LPC is considered zero TCC and TPC needto be minimized simultaneously to achieve an acceptabletradeoff

Step 3 (Perform Nondominated Sort [4 46]) On the givenpopulation a fast nondominated sorting is performed in thefollowing manner

Twoobjective functions are consideredThefirst objectiveis to minimize the total processing cost (TPC) and thesecond objective is to minimize the total communicationcost (TCC) NSGA-II attempts to find a tradeoff betweenthese two objectives that can result in minimum total queryprocessing cost (TC)

6 The Scientific World Journal

Input 119877119894 Relations participating in the query 119875

119888 Probability of crossover

119875119898 Probability of mutation 119866 Pre-defined number of generations 119875size Population Size

Output Top-n query plansMethodInitialize a random parent population of query plans PPwhere chromosome length ldquolenrdquo is the number of relations accessed in the query and the gene at the 119896th position in thechromosome represents the site of the 119896th relationWHILE generation le 119866 DOStep 1 Evaluate each query plan in PP on the following objective functions

1198981 Minimize TCC =

119894=119906minus1119895=119906

sum

119894=1119895=119894+1

CC119894119895times 119887119894

1198982 Minimize TPC =

119906

sum

119894=1

LPC119894times 119886119894

where 119906 is the number of sites accessed by the query plan in ascending order of cardinality per site CC119894119895is the communication

cost per byte between sites 119894 and 119895 LPC119894is the local processing cost per byte at site 119894 119887

119894is the bytes to be communicated from

site 119894 and 119886119894is the bytes to be processed at site 119894

Step 2 Perform Non-Dominated (ND) Sort on PP for ldquo1198981rdquo and ldquo119898

2rdquo separately and place each query plan (QP)

into corresponding ND fronts ldquo119894rdquo and sort the QPs within each ldquo

119894rdquo

Step 3 Evaluate Crowding Distance Function 119868(119889) for each objective functionAssign 119868(119889) = infin for smallest and highest values in each front ldquo

119894rdquo

For the remaining QPs 119868(119889) is calculated as

119868 (119889119896) = 119868 (119889

119896) +

119868(119896 + 1)119898minus 119868(119896 minus 1)

119898

119891

max119898

minus 119891

min119898

where 119868(119896)119898is the value of119898th objective function of 119896th query plan in Front

119894and 119891

max119898

and 119891

min119898

are themaximum and minimum values obtained for the objective function119898Step 4 Perform Selection from PP using binary tournament selection using crowded comparison operator (≺

119899)

Step 5 Perform random single point crossover on selected chromosomes with crossover probability 119875119888

Step 6 Apply mutation on resulting population with mutation probability 119875119898

Let the resulting child population be CPStep 7 Append CP into PP and let the resulting intermediate population be IPStep 8 Repeat Step 1 and Step 2 for population IPStep 9 Form the population PP for the next generation by picking query plans Front-wise from IP till thepopulation size = 119875sizeStep 10 Increment Generation by 1ENDDOReturn Top-119870 Query Plans from population PP

Algorithm 1 NSGA-II based DQPG algorithm

In order to performanondominated sort each query planis compared with every other query plan in the population tofind if it is dominated For each query plan ldquo119894rdquo the followingtwo entities are considered

(i) 119899119894 The number of query plans that dominate the

query plan 119894(ii) 119878119894 The set of query plans that query plan 119894 dominates

All query plans that have 119899119894= 0 are added to the set

1

Set = 1where is called the current front For each

element 119909 in the current front visit each member 119895 in the set119878119909and reduce the count of 119899

119895by 1 Now if 119899

119895gets reduced to

zero for some 119895 add it to the set 2 After evaluating all the

members of in a similar manner set = 2 This process

continues till all the query plans are assigned some frontThe fast nondominated sorting procedure takes the currentpopulation as input and produces a list of nondominatedfronts

119894as output

Step 4 (Density Estimation Using Crowding Distance [446]) After the nondominated sort the crowding distanceis computed for each query plan in

119894 Crowding distance

[4] is an estimate of the density of solutions surrounding aparticular solution point in the population It is defined asthe average distance of the two closest points on either sidesof the given point along each of the objectives The crowdingdistance 119868(119889119894119904119905119886119899119888119890) is computed in the following manner[4 46]

For each front 119894 let 119873 be the number of query plans

in front 119894 Initially the crowding distance for each query

plan in the front 119894is zero That is 119868(119889

119895) = 0 119895 =

1 119873 Next for each objective function the query plansin the front

119894are sorted based on their value of TPC (ie

the first objective function) and similarly also with respectto TCC (ie the second objective function) and placed in119868(119889119894119904119905119886119899119888119890) The query plans having the smallest and thehighest 119868(119889119894119904119905119886119899119888119890) values in both sets are assigned an infinite

The Scientific World Journal 7

Rejected

Crowding distance sort PP

CP

IP

PP

Nondominatedsort

Ŧ1

Ŧ2

Ŧ3

Figure 2 NSGA-II procedure preserving elitism [4]

value for 119868(119889119894119904119905119886119899119888119890) that is 119868(1198891) = infin and 119868(119889

119873) = infin For

remaining query plans that is 119896 = 2 119873 minus 1 119868(119889119894119904119905119886119899119888119890)is computed as follows [4 46]

119868 (119889119896) = 119868 (119889

119896) +

119868(119896 + 1)119898minus 119868(119896 minus 1)

119898

119891

max119898

minus 119891

min119898

(10)

where 119868(119896)119898is the value of 119898th objective function of 119896th

query plan in front 119894and 119891

max119898

and 119891

min119898

are the maximumand minimum values obtained for the objective function119898

Step 5 (Binary Tournament Selection) After assigning thecrowding distance to the query plans in each front a selectionprocess is carried outThe selection scheme used is the binarytournament selection and it is carried out using the crowdedcomparison operator (≺

119899) [4 46] It uses two parameters as

given below(i) 120588rank (nondomination rank)The query plans in front

119894will have 120588rank = 119894

(ii) 119868119894(119889119895) (crowding distance in front 119894) 119895 = 1 119873

The crowded comparison is performed as described below [446]

For any two query plans 1199021199011and 119902119901

2 1199021199011is selected if

(1199021199011≺1198991199021199012) 1199021199011≺1198991199021199012is true if either one of the following

holds(i) 120588rank(1199021199011) lt 120588rank(1199021199012)(ii) if 119902119901

1and 119902119901

2belong to the same front (ie if

120588rank(1199021199011) = 120588rank(1199021199012)) then 119868119894(1198891199021199011

) gt 119868119894(1198891199021199012

)

Step 6 (Crossover andMutation) Crossover is performed onthe selected query plans with a given crossover probability119875119888 It ensures proper exploration of the search space by

combining the best features of the parent query plans (chro-mosomes) Mutation is performed on the given populationwith a given probability 119875

119898 It randomly changes the site

(gene) in which the corresponding relation resides within aquery plan (chromosome) The mutated gene always takes arandom value from a set of valid sites for a particular relationAfter going through the above steps the first generationpopulation is formed NSGA-II follows a different methodto produce subsequent generations in order to incorporateelitism as described next

Table 1 Initial parent population PP

[2 1 5 1] [3 5 4 5]

[3 3 1 1] [3 1 1 1]

[3 3 3 3] [3 4 4 4]

[2 4 1 5] [2 3 3 3]

[3 1 1 3] [2 1 3 4]

Step 7 (Preserving Good Solutions (Elitism) [4]) In subse-quent generations the new population after each generationis combined with the parent population and a new interme-diate population IP is created of size PP + CP where PP is theparent population and CP is the child population as shown inFigure 2

The non-dominated sort is applied to this intermediatepopulation and fronts are formed as described in Step 3Finally the population for the next generation is formedby adding solutions from each front till the population sizeexceeds 119875 If the last front to be included was

119888 which led

to the population overflow then query plans in Front 119888are

selected based on their crowding distance measure (Step 4)in descending order until the population size exceeds 119875

The above steps are repeated for ldquo119866rdquo generations and theTop-119870 query plans are produced as output

An example illustrating the use of the above NSGA-IIbased DQPG algorithm to generate query plans for a givendistributed query is given next

4 An Example

Consider the site relation matrix the communication-cost matrix the local processing cost matrix the distinct-tuple matrix and the size matrix used to compute thefitness of query plans given in [3] and shown below inFigure 3 Suppose the initial parent population PP com-prises of 10 query plans given in Table 1 Consider aquery that accesses four relations (1198771 1198772 1198773 and 1198774)which are distributed among five sites (1198781 1198782 1198783 1198784 and1198785)

8 The Scientific World Journal

R1 R2 R3 R4300200 500 700

Cardinality

R1 R2 R3 R430 40 50 20

Size

S1 S2 S3 S4 S52 2 3 3 2

S1 S2 S3 S4 S5

S1 150 180 150 160

S2 165 185 170 170

S3 160 160 170 150

S4 150 150 180 160

S5 170 160 190 150

Communication costmatrix

R1 R2 R3 R4

R1 04 05 04 04

R2 05 02 02 03

R3 04 02 02 01

R4 04 03 01 03

matrix for join

R1 R2 R3 R4S1 0 1 1 1S2 1 0 0 0S3 1 1 1 1S4 0 1 1 1S5 0 0 1 1

matrixLPCi

Distinct-tuple- Relation-site

mdash

mdashmdash

mdash

mdash

Figure 3 Matrices used for computing fitness [3]

Table 2 Nondominated sort on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC 119878

119894119899119894

Front[1] [2 1 5 1] 18020000 3746700 [4 10] 1

2

[2] [3 3 1 1] 6720000 6006000 [4 10] 1 2

[3] [3 3 3 3] 0 4200000 [2 4 7 8 9 10] mdash 1

[4] [2 4 1 5] 43440000 70793333 [10] 6 3

[5] [3 1 1 3] 14000000 2112000 [1 4 6 10] mdash 1

[6] [3 5 4 5] 17020000 3847000 [4 10] 1 2

[7] [3 1 1 1] 960000 6500000 [4 8 9 10] 1 2

[8] [3 4 4 4] 1020000 9750000 [10] 2 3

[9] [2 3 3 3] 1110000 9750000 [10] 2 3

[10] [2 1 3 4] 45090000 10514000 mdash 9 4

Table 3 Sorted fronts based on TCC and TPC

Query plans sorted on1198981= TCC

Query plans sorted on1198982= TPC

1

3 5 1

5 32

7 2 6 1 2

1 6 2 73

8 9 4 3

4 8 94

10 4

10

The computations of TPC and TCC for the query plan[2 4 1 5] are given as follows

TPC4= (LPC

2times 1198862) + (LPC

4times 1198864)

+ (LPC5times 1198865) + (LPC

1times 1198861)

TCC4= (CC

24times 11988724

) + (CC45

times 11988745

) + (CC51

times 11988751

)

(11)

whereLPC2times 1198862= 0

CC24

times 11988724

= CC24

times (Card (1198771) times Size (1198771))

= 170 times (200 times 30) = 1020000

LPC4times 1198864= LPC

4times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)) )

= 2 times (

200 times 300

05 times 200

times (70)) = 126000

CC45

times 11988745

= CC45

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)))

= 170 times (

200 times 300

05 times 200

times (70)) = 6720000

LPC5times 1198865

= LPC5

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 420000

The Scientific World Journal 9

Table 4 Results for the objective functions on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC Front Crowding distance (CD) = 119868[119894]

[1] [2 1 5 1] 18020000 3746700 2

infin

[2] [3 3 1 1] 6720000 6006000 2

0672[3] [3 3 3 3] 0 4200000

1infin

[4] [2 4 1 5] 43440000 70793333 3

infin

[5] [3 1 1 3] 14000000 2112000 1

infin

[6] [3 5 4 5] 17020000 3847000 2

05195[7] [3 1 1 1] 960000 6500000

2infin

[8] [3 4 4 4] 1020000 9750000 3

infin

[9] [2 3 3 3] 1110000 9750000 3

infin

[10] [2 1 3 4] 45090000 10514000 4

infin

Table 5 Selection of query plans using binary tournament selection technique

IndexRandomly generated

indexes[119894] and [119895]

Tournament betweenquery plans

[119875(119894)] and [119875(119895)]

Front comparison Query plan selected (lowerfront or higher CD)

[1] [1] and [4] [2 1 5 1] and [2 4 1 5] [2] and [

3] [2 1 5 1]

[2] [2] and [3] [3 3 1 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[3] [3] and [4] [3 3 3 3] and [2 4 1 5] [1] and [

3] [3 3 3 3]

[4] [2] and [6] [3 3 1 1] and [3 5 4 5] [2] and [

2] [3 3 1 1]

[5] [2] and [4] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

[6] [1] and [3] [2 1 5 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[7] [2] and [5] [3 3 1 1] and [3 1 1 3] [2] and [

1] [3 1 1 3]

[8] [2] and [8] [3 3 1 1] and [3 1 1 3] [2] and [

3] [3 3 1 1]

[9] [7] and [8] [3 1 1 1] and [3 1 1 3] [2] and [

3] [3 1 1 1]

[10] [2] and [9] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

Table 6 Population CP after crossover and mutation

Index Population after crossover Population aftermutation (CP)

[1] [2 1 1 1] [2 1 1 1]

[2] [3 3 1 1] [3 3 1 1]

[3] [3 3 5 1] [3 3 4 1][4] [3 3 1 1] [3 3 1 1]

[5] [3 3 3 1] [3 3 3 1]

[6] [3 3 3 3] [3 3 3 3]

[7] [3 1 1 3] [3 1 1 3]

[8] [3 3 1 3] [3 3 1 3]

[9] [3 1 3 3] [3 3 3 3]

[10] [3 3 1 1] [3 3 1 1]

CC51

times 11988751

= CC51

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

170 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 35700000

LPC1times 1198861

= LPC1times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times

Card (1198773)

Dist43

times Card (1198773)

times (

4

sum

119894=1

Size (119877119894)))

10 The Scientific World Journal

Table 7 Front assignment for query plans in IP

Index Intermediate population IP TCC TPC Front[1] [2 1 5 1] 18020000 3747000

2

[2] [3 3 1 1] 6720000 6006000 3

[3] [3 3 3 3] 0 4200000 1

[4] [2 4 1 5] 43440000 7079000 4

[5] [3 1 1 3] 14000000 2112000 1

[6] [3 5 4 5] 17020000 3847000 2

[7] [3 1 1 1] 960000 6500000 2

[8] [3 1 1 3] 1020000 9750000 3

[9] [2 3 3 3] 1110000 9750000 3

[10] [2 1 3 4] 45090000 10514000 5

[11] [2 1 1 1] 990000 6500000 2

[12] [3 3 1 1] 6720000 6006000 3

[13] [3 3 4 1] 90300000 8946000 5

[14] [3 3 1 1] 6720000 6006000 3

[15] [3 3 3 1] 2520000 4230000 2

[16] [3 3 3 3] 0 4200000 1

[17] [3 1 1 3] 14000000 2112000 1

[18] [3 3 1 3] 4500000 2640000 1

[19] [3 3 3 3] 0 4200000 1

[20] [3 3 1 1] 6720000 6006000 3

Table 8 Nondominated sorting for query plans in IP

Index Intermediatepopulation IP Front Crowding

distance[3] [3 3 3 3]

1

[5] [3 1 1 3] 1

[16] [3 3 3 3] 1

[17] [3 1 1 3] 1

[18] [3 3 1 3] 1

[19] [3 3 3 3] 1

[1] [2 1 5 1] 2 infin

[7] [3 1 1 1] 2 infin

[11] [2 1 1 1] 2 infin

[15] [3 3 3 1] 2 04933

[6] [3 5 4 5] 2 02292

[2] [3 3 1 1] 3

[9] [2 3 3 3] 3

[8] [3 1 1 3] 3

[12] [3 3 1 1] 3

[14] [3 3 1 1] 3

[20] [3 3 1 1] 3

[4] [2 4 1 5] 4

[10] [2 1 3 4] 5

[13] [3 3 4 1] 4

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

times

500

01 times 500

) times 140 = 65333333

(12)

Table 9 Population PP in the 2nd generation

Index Population PP[1] [3 3 3 3]

[2] [3 1 1 3]

[3] [3 3 3 3]

[4] [3 1 1 3]

[5] [3 3 1 3]

[6] [3 3 3 3]

[7] [2 1 5 1]

[8] [3 1 1 1]

[9] [2 1 1 1]

[10] [3 3 3 1]

So

TPC4= 126000 + 420000 + 65333333 = 70793333

TCC4= 1020000 + 6720000 + 35700000 = 43440000

(13)

Similarly TCC and TPC values of the other nine queryplans are computed Consider TCC and TPC of the 10 queryplans are given in Table 2 The population is then sorted intodifferent nondominated fronts as described in Step 3 of theproposed algorithm For example for query plan 1 that is[2 1 5 1] the set 119878

1= number of query plans that dominate

query plan 1 Since TCC [1] lt TCC [4] TPC [1] lt TPC [4]TCC [1] lt TCC [10] and TPC [1] lt TPC [10] the elementsof 1198781

= 4 10 Similarly the sets 1198782 119878

10are computed

and are given in Table 2 119899119894stores the count of query plans

that dominate 119894 So using the values in 119878119894 1198991

= 1 as onlyquery plan 5 is dominating 1 and 119899

2= 1 as only query plan 3

is dominating 2 Similarly 1198992 119899

10are computed and are

The Scientific World Journal 11

given in Table 2 From Table 2 it can be noted that queryplans 3 and 5 are not dominated as 119899

3= 0 119899

5= 0 So they

are assigned to the first nondominated front1The elementsin the next front are computed by reducing the count in 119899

119894

for each 119894 isin 1198783and 119894 isin 119878

5 So 119899

119894= 0 0 minus 4 minus 0 0 1 1 7

So the second front has query plans 1 2 6 and 7 Thisprocess continues till all the query plans in the population areassigned to their respective nondominated fronts The fronts1 2 3 and

4are formed and are given in Table 2

Finally the query plans are sorted separately on the valuesof TCC and TPC within each front as shown in Table 3

After the population is sorted into different fronts thecrowding distance (119868[119889119894119904119905119886119899119888119890]) computation is performedfor each query plan using the formula given in Step 4 ofthe proposed algorithm Query plans having the maximumand minimum values in each front are assigned infin distancevalues that is 119868[3] 119868[5] 119868[7] 119868[1] 119868[8] 119868[4] 119868[9] and119868[10] are assigned infin For the rest of the query plans theircrowding distance (CD) values are computed based on thesorted order of query plans in each front Initially they areassigned 119868[119889119894119904119905119886119899119888119890] = 0 For query plan 2 in front

2 the

CD computation is performed as follows

1198681198981[2] = 119868 [2] +

TCC [6] minus TCC [7]

max (TCC) minusmin (TCC)

= 0 +

17020000 minus 960000

45090000 minus 0

= 03562

119868 [2] = 1198681198981[2] +

TPC [7] minus TPC [6]

max (TPC) minusmin (TPC)

= 03562 +

6500000 minus 3847000

10514000 minus 2112000

= 06720

(14)

CD of the query plans in the given population is given inTable 4

Next binary tournament selection is performed on thepopulation on the basis of crowded comparison operator ≺

119899

This selection process is shown in Table 5The selected query plans undergo random single point

crossover with crossover probability 119875119888

= 05 Mutation isperformed on the selected population with mutation proba-bility 119875

119898= 002 The child population CP after crossover and

mutation is shown in Table 6Now in accordance with NSGA-II algorithm the popu-

lations from the second generation onward have to ensureelitism For this purpose the child population CP is com-bined with the parent population PP to generate intermediatepopulation IP This population is subjected to nondominatedsort and fronts are formed as given in Table 7

The population for the second generation is arrived atby selecting query plans based on front and within it basedon crowding distance-wise as described in Step 7 from theintermediate population IP till the actual population size 10is exceeded This selection is shown in Table 8

The population PP for the second generation is given inTable 9

The above process is repeated till a predefined number ofgenerations119866have elapsedThereafter Top-119870 query plans areproduced as output

0

01

02

03

04

05

06

07

08

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-5 query plans)

Figure 4

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

DQPGNSGA(top-10 query plans)

Figure 5

5 Experimental Results

The proposed NSGA-II based algorithm is implemented inMATLAB 77 in Windows 7 professional 64 bit OS with Intelcore i3CPUat 213GHzhaving 4GBRAMExperimentswerecarried out for a population of 100 query plans with eachquery plan involving 10 relations distributed over 50 sitesThese were performed on four datasets each comprising adifferent relation-site matrix Graphs were plotted to observechange in average TC (ATC) with respect to generations

12 The Scientific World Journal

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-15 query plans)

Figure 6

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-20 query plans)

Figure 7

and Top-119870 query plans for different pairs of crossover andmutation rates

First graphs showing the ATC values Top-5 Top-10Top-15 and Top-20 query plans averaged over four datasetsover 1000 generations for distinct pairs of crossover andmutation probability 119875

119888 119875119898 = 07 001 07 0005

075 001 075 0005 08 001 08 0005 085 001085 0005 using NSGA-II based DQPG algorithm(DQPGNSGA) were plotted These are shown in Figures 45 6 and 7 It can be observed from these graphs that the

0

005

01

015

02

025

03

035

04

045

05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(after 1000 generations)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGNSGA

Top-K query plans

Figure 8

0

01

02

03

04

05

06

07

08

09

1

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-5 query plans)

Figure 9

convergence to ATC is lowest in the case of 119875119888 119875119898 =

085 001 Furthermore the graph showing ATC valuesaveraged over four datasets versus Top-119870 query plans after1000 generations (Figure 8) also shows that the lowest ATCvalues achieved are for 119875

119888 119875119898 = 085 001 Thus it

can be said that DQPGNSGA performs reasonably well for119875119888 119875119898 = 085 001 In order to compare DQPGNSGA with

SGA based DQPG algorithm (DQPGSGA) similar graphswere plotted for DQPGSGA These are shown in Figures 9 1011 12 and 13 It is noted from these graphs that DQPGSGA

The Scientific World Journal 13

0

02

04

06

08

1

12

14

1618

21 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-10 query plans)

Figure 10

0

02

04

06

08

1

12

14

16

18

2

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-15 query plans)

Figure 11

also achieves convergence to the lowest ATC values for119875119888 119875119898 = 085 001

Since the two algorithms DQPGNSGA and DQPGSGAconverge to a lower ATC value for the same crossover andmutation probabilities that is 119875

119888 119875119898 = 085 001 the

comparisons of the two algorithms can be carried out forthese observed probabilities

First the two algorithmsDQPGNSGA andDQPGSGA werecompared for each of the four datasets (Dataset-1 Dataset-2 Dataset-3 and Dataset-4) on the ATC values of Top-5Top-10 Top-15 and Top-20 query plans generated over 1000

0

04

08

12

16

2

24

28

32

36

4

1 54 107

160

213

266

319

372

425

478

531

584

637

690

743

796

849

902

955

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-20 query plans)

Figure 12

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

(after 1000 generations)

Aver

age t

otal

cost

(ATC

)

DQPGSGA

Top-K query plans

Figure 13

generations for 119875119888 119875119898 = 085 001 The corresponding

graphs for each dataset are shown in Figures 14 15 16 and 17It can be observed from the graphs thatDQPGNSGA convergesto lower ATC values for Top-5 Top-10 Top-15 and Top-20query plans generated for all datasets

Furthermore graphs comparing the ATC values of Top-119870 query plans produced by DQPGNSGA and DQPGSGA after1000 generations for the four datasets were plotted andare shown in Figures 18 19 20 and 21 These graphs alsoshow average TCC (ATCC) and average TPC (ATPC) valuesof Top-119870 query plans generated by DQPGNSGA It can beobserved from these graphs thatDQPGNSGA is able to achieve

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

The Scientific World Journal 3

the sites A distributed query processing strategy is given in[3] which aims to minimize the total query processing cost(TC) given below [3]

TC =

119898

sum

119894=1

LPC119894times 119886119894+

119894=119898minus1 119895=119898

sum

119894=1 119895=119894+1

CC119894119895times 119887119894 (1)

where LPC119894is the local processing cost per byte at site 119894 CC

119894119895

is the communication cost per byte between sites 119894 and 119895119886119894is the bytes to be processed at site 119894 119887

119894is the bytes to be

communicated from site 119894 and119898 is the total number of sitesFor each relation 119877

119905 Card(119877

119905) represents its cardinality and

Size(119877119905) represents the size of a single tuple in bytes At each

site the relations are integrated on common attributes usingthe equijoin operator to arrive at a single relation [3]

For relations 119877119905 with cardinality Card(119877

119905) and 119877

119904with

cardinality Card(119877119904) at site 119894 the cardinality Card

119894of the

resultant relation 119877119894is given as [3]

Card119894=

Card (119877119905) times Card (119877

119904)

Dist119905119904timesmin (Card (119877

119905) Card (119877

119904))

(2)

where Dist119905119904is the number of distinct tuples in the smaller

relation among 119877119905and 119877

119904

The size of the resultant relation119877119894at site 119894 is given as [1 3]

Size119894= Size (119877

119905) + Size (119877

119904) (3)

For a given query plan the communication between sitesoccurs in the order starting from the site having a relationwith lower cardinality to the site having a relation withhigher cardinality [3] The communication cost CC

119894119895and

local processing cost LPC119894are known a priori

The number of bytes to be processed locally at site 119896 isgiven by 119886

119896[3]

119886119896= Card

119894times Size

119894 (4)

The number of bytes to be communicated from site 119895 tosite 119896 is given by 119887

119895[3]

119887119895= Card

119895times Size

119895 (5)

Distributed query plans based on the above heuristic isgenerated using simple GA (SGA) in [3] This SGA basedDQPG as given in [3] is discussed next

22 SGA Based DQPG As discussed above it is a very com-plex task to generate efficient query plans from among a largeset of possible query plans An SGA based DQPG strategybased on the heuristic defined above is given in [3] whichaims to minimize the total cost of query processing (TC)indicating the fitness of a particular solution as compared toothers in the population The algorithm considers relationsaccessed by the query crossover and mutation probabilityand the prespecified number of generations (119866) as inputand produces the Top-119870 query plans as output First thealgorithm randomly generates an initial population of validquery plans (chromosomes) where the size of a query plan

is equal to the number of relations accessed by the queryEach gene in a chromosome represents a relation and theordering of relations in a chromosome is in increasing orderof their cardinality The value of a gene is the site where thecorresponding relation resides As an example for a queryaccessing four relations (1198771 1198772 1198773 and 1198774) arranged in theincreasing order of cardinalities one of the encoding schemesfor the chromosome representation can be (1 1 4 3) implyingthat 1198771 and 1198772 are in site 1 1198773 is in site 4 and 1198774 is insite 3 The fitness (TC) value is computed for each of thequery plans and thereafter the query plans are selected forcrossover using the binary tournament selection technique[35] These selected query plans undergo random single-point crossover [15 36] with probability 119875

119888 and mutation

[15 36] with probability 119875119898 The resultant new population

replaces the old population and the above process is repeatedfor the prespecified number of generations 119866 Thereafter theTop-119870 query plans are produced as output In this paperthe above single objective DQPG problem is formulated andsolved as amultiobjectiveDQPGproblem aswill be discussednext

3 DQPG Using MultiobjectiveGenetic Algorithm

In this paper the single objective DQPG problem discussedabove is formulated as a biobjective DQPG problem Thisformulation is given next

31 Multiobjective DQPG Problem Formulation In the GAbased DQPG algorithm given in [3] there is a single objec-tive that is Minimize TC It can be observed that TC com-prises two costs namely the local processing cost incurredat participating sites that is total processing cost (TPC)and communication cost between the participating sites thatis total communication cost (TCC) Since minimizing TCwould require minimizing TPC and minimizing TCC thissingle objective (Minimize TC)DQPGproblem is formulatedas a biobjective DQPG problem comprising two objectives asMinimize TPC andMinimize TCC Consider

TCC =

119894=119906minus1 119895=119906

sum

119894=1 119895=119894+1

CC119894119895times 119887119894

TPC =

119906

sum

119894=1

LPC119894times 119886119894

(6)

where 119906 is the number of sites accessed by the queryplan in ascending order of cardinality per site CC

119894119895is the

communication cost per byte between sites 119894 and 119895 LPC119894is

the local processing cost per byte at site 119894 119887119894is the bytes to be

communicated from site 119894 and 119886119894is the bytes to be processed

at site 119894 CC119894119895 LPC

119894 119887119894 and 119886

119894are as discussed in Section 21

If a site contains a single relation its LPC is consideredzero TCC and TPC need to be minimized simultaneously toachieve an acceptable tradeoff

4 The Scientific World Journal

The abovemultiobjectiveDQPGproblemhas been solvedusing the multiobjective genetic algorithm which is dis-cussed next

32 Multiobjective Genetic Algorithms Conceptualization ofmultiobjective problems using veridical models has a greatresemblance to many real world engineering and designproblems that involve more than one coextensive and oftencompeting objectives that is maximize profit maximizethroughput minimize cost minimize response time and soforth In such a scenario no single solution can be termedas optimal as in the case of single objective optimizationproblems but rather a set of alternative solutions can bevisualized as a tradeoff between the different objectives underconsideration This set of solutions is regarded superior toothers in the search space as no other recordedavailablesolution can better optimize all the objectives consideredtogether [37ndash39]

Multiobjective optimization approaches can be broadlyclassified into three categories [37] The approaches in thefirst two categories can be termed as the classical optimiza-tion approaches which combine all objectives into a singlecomposite function using some combination of arithmeticoperators ormove all but one objective into the constraint setThe approaches in the first category have limitations in regardto appropriate selection of weights and designing functions inaccordance to the problem It wouldmandate the user to havea priori knowledge of the behavior of each objective functionto some extent for providing the range of values to objectivesso that none of them dominate the others which is not alwayspossible [17] This approach is generally denominated asaggregating functions and it has been implemented at severaloccasionswith relative success in situationswhere behavior ofthe objective function ismore or lesswell-known Someof theaggregating functions include the weighted sum approachgoal programming 120576-constraintmethod and so forth [40] Inthe second approach moving the objectives into a constraintset requires that the boundary values for each of the objectivesbe known a priori which is almost impossible In either of thetwo cases the optimization method returns a single solutionrather than a set of solutions giving possible tradeoffs andtherefore the quality of solution in these approaches greatlydepends upon the correct problem formulation If feasiblethese would be the most efficient and simplest approacheswhich would give atleast sub optimal results in most cases

The third approach overcomes the problems faced inthe classical optimization approaches and emphasizes thedevelopment of alternative techniques based on exploringthe complete set of nondominated solutions and therebyenabling the decision maker to choose among the differentalternatives This set of solutions is referred to as the Paretooptimal set [13] A Pareto optimal set can be formally definedas a set of solutions that are nondominated with respectto each other that is replacing one solution with anotherwithin the Pareto optimal set will invariably lead to a lossto one objective against a gain obtained in another objective[41] Pareto optimal sets can have varied sizes but usuallythe size increases with increase in the number of objectives

[37 40] They are more preferred over single solutionsas they closely resemble real world problems where thedecision maker makes a decision based on tradeoffs betweenmultiple objectives A number of techniques were formulatedto generate the Pareto optimal set for example simulatedannealing [14] Tabu search [42] ant colony optimization[34] and so forth The problem with these algorithms wasthatmost often they get struck at local optima and thus renderit infeasible to venture out for identifying new tradeoffsEvolutionary algorithms such as GA on the other hand seemto be especially suited for this task as they enable parallelexploration of different areas in the search space eventuallyexploiting the solutions attained using operators such ascrossover and mutation [13] It would enable determiningmore members of a Pareto optimal set in a single run insteadof a series of runs required in other blind search strategiesAlso the evolutionary algorithms require very little a prioriknowledge of the problem at hand and therefore are lesssusceptible to the typical shape and continuity of the ParetofrontThe Pareto front can be defined as the points that lie onthe boundary of the Pareto optimal region These algorithmsthus avoid convergence to a suboptimal solution [43]

Mathematically a multiobjective optimization problemwith 119898 decision variables and 119899 objectives can be definedwithout any loss of generality as amaximization orminimiza-tion problem given by [13 38]

minimize = 119891 (119909) = (1198911 (119909) 1198912 (119909) 119891119899 (119909))

where119909 = (1199091 1199092 119909119898) isin 119883

119910 = (1199101 1199102 119910119899) isin 119884

(7)

Here 119909 is the decision vector 119883 refers to the parameterspace 119910 is the objective vector and 119884 defines the objectivespace These objectives may be conflicting in nature that isimprovement in one may lead to deterioration in anotherSo it may become impossible to optimize all objectivessimultaneously in a single solution Instead the best tradeoffsolution would be of interest to a decision maker Thesesolutions form a Pareto optimal set which was initially coinedby Edgeworth and Pareto and is formally defined as [13 38]

ldquoA decision vector 119886 is said to be Pareto optimal if andonly if 119886 is nondominated regarding 119883 A decision vector 119886is said to be nondominated regarding a set 1198831015840 sube 119883 if andonly if there is no vector in 119883

1015840 which dominates 119886 Formallyit can be defined aslt1198861015840 isin 119883

1015840

119886

1015840

≺ 119886rdquoAlso a decision vector 119886 isin 119883 is said to dominate a

decision vector 119887 isin 119883 (also written as 119886 ≺ 119887) if and onlyif

forall119894 isin 1 2 119899 119891119894(119886) le 119891

119894(119887)

forall119895 isin 1 2 119899 119891119895(119886) le 119891

119895(119887)

(8)

Several multiobjective algorithms exist in the literature[4 18ndash25 37 40 41 44 45] of whichGA basedmultiobjectiveoptimization algorithms have been widely used for solvingmultiobjective optimization problems In this paper NSGA-II has been used to solve the DQPG problem NSGA-II willbe discussed next

The Scientific World Journal 5

f(x

2)

f(x1)

minus 1

+ 1

Figure 1 Crowding distance for solution ldquoVrdquo [4]

321 NSGA-II The basis of NSGA-II [4] lies in the non-dominated sorting genetic algorithm (NSGA) introduced bySrinivas and Deb [22] As the name suggests NSGA usesnondominated ranking for each individual in the populationand assigns them accordingly into nondominated frontsTheindividuals in the first front or the nondominated individualsare then assigned large dummy fitness values All individualsin the front shared this fitness value based on a sharing func-tion Next the individuals in the second nondominated frontare considered and similarly assigned a dummy fitness lowerthan the fitness assigned in the previous front This processcontinues till the entire population is classified into frontsSince the solutions in the first front have themaximumfitnessvalue their chances of selection increase and eventually morecopies of such solutions get passed on to the next generationHowever NSGA suffered from some drawbacks such as highcomputational complexity119874(119898119873

3

) nonelitist approach andthe requirement of specifying a shared parameter [4] Theselimitations were addressed in NSGA-II proposed by Deb etal [4] as an improved version of NSGA [22] It alleviatesthe drawbacks in NSGA by reducing the computationalcomplexity to 119874(119898119873

2

) Further it uses a parameter-lesssharing approach by using a crowding distance measurefor selection The crowding distance is an estimate of thedensity of solutions surrounding a particular solution in theobjective space In Figure 1 the crowding distance of solutionrepresented as point V is computed as the average distancebetween the two closest solutions represented as points V minus 1

and V + 1 on either side of the points V along each of theobjectives 119891(1199091) and 119891(1199092)

NSGA-II uses a crowded-comparison operator for selec-tion which takes into account both the nondomination rankof a query plan in the population and its crowding distanceThe nondominated solutions are preferred over dominatedsolutions and between two solutions having the same ranka solution that resides in the less crowded region is preferredthat is a solution for which the crowding distance is higherTheNSGA-II does not use any externalmemory but it ensureselitism by combining the best parents with the best offspringobtained [19] In this paper an NSGA-II based multiobjective

DQPG algorithm is used to compute optimal query plansfor a given distributed query This algorithm is discussednext

33 NSGA-II Based DQPG Algorithm The proposed NSGA-II based DQPG algorithm takes the relations given in theFROM clause of the distributed query as input It arrangesthese relations in increasing order of their cardinalities Itthen generates a fixed set of feasible query plans (chromo-somes) based on the possible combinations of sites in whichthese relations are residing Each gene in a chromosomerepresents a relation and is arranged in increasing orderof the corresponding relationrsquos cardinality The value of agene represents the site in which the corresponding relationresides For example suppose that a query posed by the userhas 4 relations (1198771 1198772 1198773 and 1198774) arranged in ascendingorderThe relation 1198771 is stored in sites 1198781 and 1198783 1198772 is storedin 11987811198773 is stored in 1198781 and 1198782 and1198774 is stored in 1198781Then theinitial population of feasible query plans (chromosomes) canbe (1 1 1 1) (3 1 1 1) (3 1 2 1) and (1 1 2 1) This definesthe encoding scheme for the given problem The proposedDQPG algorithm based on NSGA-II is given in Algorithm 1The steps involved in this algorithm are discussed as follows

Step 1 (Initialize the Population [4 46]) A random popula-tion of query plans is generated as per the encoding schemediscussed above

Step 2 (EvaluateQuery Plans on theObjective Functions)For each of the query plans in the population the TCC andTPC values are computed as given below

TCC =

119894=119906minus1 119895=119906

sum

119894=1 119895=119894+1

CC119894119895times 119887119894

TPC =

119906

sum

119894=1

LPC119894times 119886119894

(9)

where 119906 is the number of sites accessed by the queryplan in ascending order of cardinality per site CC

119894119895is the

communication cost per byte between sites 119894 and 119895 LPC119894

is the local processing cost per byte at site 119894 119887119894is the bytes

to be communicated from site 119894 and 119886119894is the bytes to be

processed at site 119894 The procedure to compute CC119894119895 LPC

119894

119887119894 and 119886

119894is given in Section 21 If a site contains a single

relation its LPC is considered zero TCC and TPC needto be minimized simultaneously to achieve an acceptabletradeoff

Step 3 (Perform Nondominated Sort [4 46]) On the givenpopulation a fast nondominated sorting is performed in thefollowing manner

Twoobjective functions are consideredThefirst objectiveis to minimize the total processing cost (TPC) and thesecond objective is to minimize the total communicationcost (TCC) NSGA-II attempts to find a tradeoff betweenthese two objectives that can result in minimum total queryprocessing cost (TC)

6 The Scientific World Journal

Input 119877119894 Relations participating in the query 119875

119888 Probability of crossover

119875119898 Probability of mutation 119866 Pre-defined number of generations 119875size Population Size

Output Top-n query plansMethodInitialize a random parent population of query plans PPwhere chromosome length ldquolenrdquo is the number of relations accessed in the query and the gene at the 119896th position in thechromosome represents the site of the 119896th relationWHILE generation le 119866 DOStep 1 Evaluate each query plan in PP on the following objective functions

1198981 Minimize TCC =

119894=119906minus1119895=119906

sum

119894=1119895=119894+1

CC119894119895times 119887119894

1198982 Minimize TPC =

119906

sum

119894=1

LPC119894times 119886119894

where 119906 is the number of sites accessed by the query plan in ascending order of cardinality per site CC119894119895is the communication

cost per byte between sites 119894 and 119895 LPC119894is the local processing cost per byte at site 119894 119887

119894is the bytes to be communicated from

site 119894 and 119886119894is the bytes to be processed at site 119894

Step 2 Perform Non-Dominated (ND) Sort on PP for ldquo1198981rdquo and ldquo119898

2rdquo separately and place each query plan (QP)

into corresponding ND fronts ldquo119894rdquo and sort the QPs within each ldquo

119894rdquo

Step 3 Evaluate Crowding Distance Function 119868(119889) for each objective functionAssign 119868(119889) = infin for smallest and highest values in each front ldquo

119894rdquo

For the remaining QPs 119868(119889) is calculated as

119868 (119889119896) = 119868 (119889

119896) +

119868(119896 + 1)119898minus 119868(119896 minus 1)

119898

119891

max119898

minus 119891

min119898

where 119868(119896)119898is the value of119898th objective function of 119896th query plan in Front

119894and 119891

max119898

and 119891

min119898

are themaximum and minimum values obtained for the objective function119898Step 4 Perform Selection from PP using binary tournament selection using crowded comparison operator (≺

119899)

Step 5 Perform random single point crossover on selected chromosomes with crossover probability 119875119888

Step 6 Apply mutation on resulting population with mutation probability 119875119898

Let the resulting child population be CPStep 7 Append CP into PP and let the resulting intermediate population be IPStep 8 Repeat Step 1 and Step 2 for population IPStep 9 Form the population PP for the next generation by picking query plans Front-wise from IP till thepopulation size = 119875sizeStep 10 Increment Generation by 1ENDDOReturn Top-119870 Query Plans from population PP

Algorithm 1 NSGA-II based DQPG algorithm

In order to performanondominated sort each query planis compared with every other query plan in the population tofind if it is dominated For each query plan ldquo119894rdquo the followingtwo entities are considered

(i) 119899119894 The number of query plans that dominate the

query plan 119894(ii) 119878119894 The set of query plans that query plan 119894 dominates

All query plans that have 119899119894= 0 are added to the set

1

Set = 1where is called the current front For each

element 119909 in the current front visit each member 119895 in the set119878119909and reduce the count of 119899

119895by 1 Now if 119899

119895gets reduced to

zero for some 119895 add it to the set 2 After evaluating all the

members of in a similar manner set = 2 This process

continues till all the query plans are assigned some frontThe fast nondominated sorting procedure takes the currentpopulation as input and produces a list of nondominatedfronts

119894as output

Step 4 (Density Estimation Using Crowding Distance [446]) After the nondominated sort the crowding distanceis computed for each query plan in

119894 Crowding distance

[4] is an estimate of the density of solutions surrounding aparticular solution point in the population It is defined asthe average distance of the two closest points on either sidesof the given point along each of the objectives The crowdingdistance 119868(119889119894119904119905119886119899119888119890) is computed in the following manner[4 46]

For each front 119894 let 119873 be the number of query plans

in front 119894 Initially the crowding distance for each query

plan in the front 119894is zero That is 119868(119889

119895) = 0 119895 =

1 119873 Next for each objective function the query plansin the front

119894are sorted based on their value of TPC (ie

the first objective function) and similarly also with respectto TCC (ie the second objective function) and placed in119868(119889119894119904119905119886119899119888119890) The query plans having the smallest and thehighest 119868(119889119894119904119905119886119899119888119890) values in both sets are assigned an infinite

The Scientific World Journal 7

Rejected

Crowding distance sort PP

CP

IP

PP

Nondominatedsort

Ŧ1

Ŧ2

Ŧ3

Figure 2 NSGA-II procedure preserving elitism [4]

value for 119868(119889119894119904119905119886119899119888119890) that is 119868(1198891) = infin and 119868(119889

119873) = infin For

remaining query plans that is 119896 = 2 119873 minus 1 119868(119889119894119904119905119886119899119888119890)is computed as follows [4 46]

119868 (119889119896) = 119868 (119889

119896) +

119868(119896 + 1)119898minus 119868(119896 minus 1)

119898

119891

max119898

minus 119891

min119898

(10)

where 119868(119896)119898is the value of 119898th objective function of 119896th

query plan in front 119894and 119891

max119898

and 119891

min119898

are the maximumand minimum values obtained for the objective function119898

Step 5 (Binary Tournament Selection) After assigning thecrowding distance to the query plans in each front a selectionprocess is carried outThe selection scheme used is the binarytournament selection and it is carried out using the crowdedcomparison operator (≺

119899) [4 46] It uses two parameters as

given below(i) 120588rank (nondomination rank)The query plans in front

119894will have 120588rank = 119894

(ii) 119868119894(119889119895) (crowding distance in front 119894) 119895 = 1 119873

The crowded comparison is performed as described below [446]

For any two query plans 1199021199011and 119902119901

2 1199021199011is selected if

(1199021199011≺1198991199021199012) 1199021199011≺1198991199021199012is true if either one of the following

holds(i) 120588rank(1199021199011) lt 120588rank(1199021199012)(ii) if 119902119901

1and 119902119901

2belong to the same front (ie if

120588rank(1199021199011) = 120588rank(1199021199012)) then 119868119894(1198891199021199011

) gt 119868119894(1198891199021199012

)

Step 6 (Crossover andMutation) Crossover is performed onthe selected query plans with a given crossover probability119875119888 It ensures proper exploration of the search space by

combining the best features of the parent query plans (chro-mosomes) Mutation is performed on the given populationwith a given probability 119875

119898 It randomly changes the site

(gene) in which the corresponding relation resides within aquery plan (chromosome) The mutated gene always takes arandom value from a set of valid sites for a particular relationAfter going through the above steps the first generationpopulation is formed NSGA-II follows a different methodto produce subsequent generations in order to incorporateelitism as described next

Table 1 Initial parent population PP

[2 1 5 1] [3 5 4 5]

[3 3 1 1] [3 1 1 1]

[3 3 3 3] [3 4 4 4]

[2 4 1 5] [2 3 3 3]

[3 1 1 3] [2 1 3 4]

Step 7 (Preserving Good Solutions (Elitism) [4]) In subse-quent generations the new population after each generationis combined with the parent population and a new interme-diate population IP is created of size PP + CP where PP is theparent population and CP is the child population as shown inFigure 2

The non-dominated sort is applied to this intermediatepopulation and fronts are formed as described in Step 3Finally the population for the next generation is formedby adding solutions from each front till the population sizeexceeds 119875 If the last front to be included was

119888 which led

to the population overflow then query plans in Front 119888are

selected based on their crowding distance measure (Step 4)in descending order until the population size exceeds 119875

The above steps are repeated for ldquo119866rdquo generations and theTop-119870 query plans are produced as output

An example illustrating the use of the above NSGA-IIbased DQPG algorithm to generate query plans for a givendistributed query is given next

4 An Example

Consider the site relation matrix the communication-cost matrix the local processing cost matrix the distinct-tuple matrix and the size matrix used to compute thefitness of query plans given in [3] and shown below inFigure 3 Suppose the initial parent population PP com-prises of 10 query plans given in Table 1 Consider aquery that accesses four relations (1198771 1198772 1198773 and 1198774)which are distributed among five sites (1198781 1198782 1198783 1198784 and1198785)

8 The Scientific World Journal

R1 R2 R3 R4300200 500 700

Cardinality

R1 R2 R3 R430 40 50 20

Size

S1 S2 S3 S4 S52 2 3 3 2

S1 S2 S3 S4 S5

S1 150 180 150 160

S2 165 185 170 170

S3 160 160 170 150

S4 150 150 180 160

S5 170 160 190 150

Communication costmatrix

R1 R2 R3 R4

R1 04 05 04 04

R2 05 02 02 03

R3 04 02 02 01

R4 04 03 01 03

matrix for join

R1 R2 R3 R4S1 0 1 1 1S2 1 0 0 0S3 1 1 1 1S4 0 1 1 1S5 0 0 1 1

matrixLPCi

Distinct-tuple- Relation-site

mdash

mdashmdash

mdash

mdash

Figure 3 Matrices used for computing fitness [3]

Table 2 Nondominated sort on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC 119878

119894119899119894

Front[1] [2 1 5 1] 18020000 3746700 [4 10] 1

2

[2] [3 3 1 1] 6720000 6006000 [4 10] 1 2

[3] [3 3 3 3] 0 4200000 [2 4 7 8 9 10] mdash 1

[4] [2 4 1 5] 43440000 70793333 [10] 6 3

[5] [3 1 1 3] 14000000 2112000 [1 4 6 10] mdash 1

[6] [3 5 4 5] 17020000 3847000 [4 10] 1 2

[7] [3 1 1 1] 960000 6500000 [4 8 9 10] 1 2

[8] [3 4 4 4] 1020000 9750000 [10] 2 3

[9] [2 3 3 3] 1110000 9750000 [10] 2 3

[10] [2 1 3 4] 45090000 10514000 mdash 9 4

Table 3 Sorted fronts based on TCC and TPC

Query plans sorted on1198981= TCC

Query plans sorted on1198982= TPC

1

3 5 1

5 32

7 2 6 1 2

1 6 2 73

8 9 4 3

4 8 94

10 4

10

The computations of TPC and TCC for the query plan[2 4 1 5] are given as follows

TPC4= (LPC

2times 1198862) + (LPC

4times 1198864)

+ (LPC5times 1198865) + (LPC

1times 1198861)

TCC4= (CC

24times 11988724

) + (CC45

times 11988745

) + (CC51

times 11988751

)

(11)

whereLPC2times 1198862= 0

CC24

times 11988724

= CC24

times (Card (1198771) times Size (1198771))

= 170 times (200 times 30) = 1020000

LPC4times 1198864= LPC

4times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)) )

= 2 times (

200 times 300

05 times 200

times (70)) = 126000

CC45

times 11988745

= CC45

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)))

= 170 times (

200 times 300

05 times 200

times (70)) = 6720000

LPC5times 1198865

= LPC5

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 420000

The Scientific World Journal 9

Table 4 Results for the objective functions on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC Front Crowding distance (CD) = 119868[119894]

[1] [2 1 5 1] 18020000 3746700 2

infin

[2] [3 3 1 1] 6720000 6006000 2

0672[3] [3 3 3 3] 0 4200000

1infin

[4] [2 4 1 5] 43440000 70793333 3

infin

[5] [3 1 1 3] 14000000 2112000 1

infin

[6] [3 5 4 5] 17020000 3847000 2

05195[7] [3 1 1 1] 960000 6500000

2infin

[8] [3 4 4 4] 1020000 9750000 3

infin

[9] [2 3 3 3] 1110000 9750000 3

infin

[10] [2 1 3 4] 45090000 10514000 4

infin

Table 5 Selection of query plans using binary tournament selection technique

IndexRandomly generated

indexes[119894] and [119895]

Tournament betweenquery plans

[119875(119894)] and [119875(119895)]

Front comparison Query plan selected (lowerfront or higher CD)

[1] [1] and [4] [2 1 5 1] and [2 4 1 5] [2] and [

3] [2 1 5 1]

[2] [2] and [3] [3 3 1 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[3] [3] and [4] [3 3 3 3] and [2 4 1 5] [1] and [

3] [3 3 3 3]

[4] [2] and [6] [3 3 1 1] and [3 5 4 5] [2] and [

2] [3 3 1 1]

[5] [2] and [4] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

[6] [1] and [3] [2 1 5 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[7] [2] and [5] [3 3 1 1] and [3 1 1 3] [2] and [

1] [3 1 1 3]

[8] [2] and [8] [3 3 1 1] and [3 1 1 3] [2] and [

3] [3 3 1 1]

[9] [7] and [8] [3 1 1 1] and [3 1 1 3] [2] and [

3] [3 1 1 1]

[10] [2] and [9] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

Table 6 Population CP after crossover and mutation

Index Population after crossover Population aftermutation (CP)

[1] [2 1 1 1] [2 1 1 1]

[2] [3 3 1 1] [3 3 1 1]

[3] [3 3 5 1] [3 3 4 1][4] [3 3 1 1] [3 3 1 1]

[5] [3 3 3 1] [3 3 3 1]

[6] [3 3 3 3] [3 3 3 3]

[7] [3 1 1 3] [3 1 1 3]

[8] [3 3 1 3] [3 3 1 3]

[9] [3 1 3 3] [3 3 3 3]

[10] [3 3 1 1] [3 3 1 1]

CC51

times 11988751

= CC51

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

170 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 35700000

LPC1times 1198861

= LPC1times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times

Card (1198773)

Dist43

times Card (1198773)

times (

4

sum

119894=1

Size (119877119894)))

10 The Scientific World Journal

Table 7 Front assignment for query plans in IP

Index Intermediate population IP TCC TPC Front[1] [2 1 5 1] 18020000 3747000

2

[2] [3 3 1 1] 6720000 6006000 3

[3] [3 3 3 3] 0 4200000 1

[4] [2 4 1 5] 43440000 7079000 4

[5] [3 1 1 3] 14000000 2112000 1

[6] [3 5 4 5] 17020000 3847000 2

[7] [3 1 1 1] 960000 6500000 2

[8] [3 1 1 3] 1020000 9750000 3

[9] [2 3 3 3] 1110000 9750000 3

[10] [2 1 3 4] 45090000 10514000 5

[11] [2 1 1 1] 990000 6500000 2

[12] [3 3 1 1] 6720000 6006000 3

[13] [3 3 4 1] 90300000 8946000 5

[14] [3 3 1 1] 6720000 6006000 3

[15] [3 3 3 1] 2520000 4230000 2

[16] [3 3 3 3] 0 4200000 1

[17] [3 1 1 3] 14000000 2112000 1

[18] [3 3 1 3] 4500000 2640000 1

[19] [3 3 3 3] 0 4200000 1

[20] [3 3 1 1] 6720000 6006000 3

Table 8 Nondominated sorting for query plans in IP

Index Intermediatepopulation IP Front Crowding

distance[3] [3 3 3 3]

1

[5] [3 1 1 3] 1

[16] [3 3 3 3] 1

[17] [3 1 1 3] 1

[18] [3 3 1 3] 1

[19] [3 3 3 3] 1

[1] [2 1 5 1] 2 infin

[7] [3 1 1 1] 2 infin

[11] [2 1 1 1] 2 infin

[15] [3 3 3 1] 2 04933

[6] [3 5 4 5] 2 02292

[2] [3 3 1 1] 3

[9] [2 3 3 3] 3

[8] [3 1 1 3] 3

[12] [3 3 1 1] 3

[14] [3 3 1 1] 3

[20] [3 3 1 1] 3

[4] [2 4 1 5] 4

[10] [2 1 3 4] 5

[13] [3 3 4 1] 4

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

times

500

01 times 500

) times 140 = 65333333

(12)

Table 9 Population PP in the 2nd generation

Index Population PP[1] [3 3 3 3]

[2] [3 1 1 3]

[3] [3 3 3 3]

[4] [3 1 1 3]

[5] [3 3 1 3]

[6] [3 3 3 3]

[7] [2 1 5 1]

[8] [3 1 1 1]

[9] [2 1 1 1]

[10] [3 3 3 1]

So

TPC4= 126000 + 420000 + 65333333 = 70793333

TCC4= 1020000 + 6720000 + 35700000 = 43440000

(13)

Similarly TCC and TPC values of the other nine queryplans are computed Consider TCC and TPC of the 10 queryplans are given in Table 2 The population is then sorted intodifferent nondominated fronts as described in Step 3 of theproposed algorithm For example for query plan 1 that is[2 1 5 1] the set 119878

1= number of query plans that dominate

query plan 1 Since TCC [1] lt TCC [4] TPC [1] lt TPC [4]TCC [1] lt TCC [10] and TPC [1] lt TPC [10] the elementsof 1198781

= 4 10 Similarly the sets 1198782 119878

10are computed

and are given in Table 2 119899119894stores the count of query plans

that dominate 119894 So using the values in 119878119894 1198991

= 1 as onlyquery plan 5 is dominating 1 and 119899

2= 1 as only query plan 3

is dominating 2 Similarly 1198992 119899

10are computed and are

The Scientific World Journal 11

given in Table 2 From Table 2 it can be noted that queryplans 3 and 5 are not dominated as 119899

3= 0 119899

5= 0 So they

are assigned to the first nondominated front1The elementsin the next front are computed by reducing the count in 119899

119894

for each 119894 isin 1198783and 119894 isin 119878

5 So 119899

119894= 0 0 minus 4 minus 0 0 1 1 7

So the second front has query plans 1 2 6 and 7 Thisprocess continues till all the query plans in the population areassigned to their respective nondominated fronts The fronts1 2 3 and

4are formed and are given in Table 2

Finally the query plans are sorted separately on the valuesof TCC and TPC within each front as shown in Table 3

After the population is sorted into different fronts thecrowding distance (119868[119889119894119904119905119886119899119888119890]) computation is performedfor each query plan using the formula given in Step 4 ofthe proposed algorithm Query plans having the maximumand minimum values in each front are assigned infin distancevalues that is 119868[3] 119868[5] 119868[7] 119868[1] 119868[8] 119868[4] 119868[9] and119868[10] are assigned infin For the rest of the query plans theircrowding distance (CD) values are computed based on thesorted order of query plans in each front Initially they areassigned 119868[119889119894119904119905119886119899119888119890] = 0 For query plan 2 in front

2 the

CD computation is performed as follows

1198681198981[2] = 119868 [2] +

TCC [6] minus TCC [7]

max (TCC) minusmin (TCC)

= 0 +

17020000 minus 960000

45090000 minus 0

= 03562

119868 [2] = 1198681198981[2] +

TPC [7] minus TPC [6]

max (TPC) minusmin (TPC)

= 03562 +

6500000 minus 3847000

10514000 minus 2112000

= 06720

(14)

CD of the query plans in the given population is given inTable 4

Next binary tournament selection is performed on thepopulation on the basis of crowded comparison operator ≺

119899

This selection process is shown in Table 5The selected query plans undergo random single point

crossover with crossover probability 119875119888

= 05 Mutation isperformed on the selected population with mutation proba-bility 119875

119898= 002 The child population CP after crossover and

mutation is shown in Table 6Now in accordance with NSGA-II algorithm the popu-

lations from the second generation onward have to ensureelitism For this purpose the child population CP is com-bined with the parent population PP to generate intermediatepopulation IP This population is subjected to nondominatedsort and fronts are formed as given in Table 7

The population for the second generation is arrived atby selecting query plans based on front and within it basedon crowding distance-wise as described in Step 7 from theintermediate population IP till the actual population size 10is exceeded This selection is shown in Table 8

The population PP for the second generation is given inTable 9

The above process is repeated till a predefined number ofgenerations119866have elapsedThereafter Top-119870 query plans areproduced as output

0

01

02

03

04

05

06

07

08

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-5 query plans)

Figure 4

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

DQPGNSGA(top-10 query plans)

Figure 5

5 Experimental Results

The proposed NSGA-II based algorithm is implemented inMATLAB 77 in Windows 7 professional 64 bit OS with Intelcore i3CPUat 213GHzhaving 4GBRAMExperimentswerecarried out for a population of 100 query plans with eachquery plan involving 10 relations distributed over 50 sitesThese were performed on four datasets each comprising adifferent relation-site matrix Graphs were plotted to observechange in average TC (ATC) with respect to generations

12 The Scientific World Journal

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-15 query plans)

Figure 6

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-20 query plans)

Figure 7

and Top-119870 query plans for different pairs of crossover andmutation rates

First graphs showing the ATC values Top-5 Top-10Top-15 and Top-20 query plans averaged over four datasetsover 1000 generations for distinct pairs of crossover andmutation probability 119875

119888 119875119898 = 07 001 07 0005

075 001 075 0005 08 001 08 0005 085 001085 0005 using NSGA-II based DQPG algorithm(DQPGNSGA) were plotted These are shown in Figures 45 6 and 7 It can be observed from these graphs that the

0

005

01

015

02

025

03

035

04

045

05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(after 1000 generations)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGNSGA

Top-K query plans

Figure 8

0

01

02

03

04

05

06

07

08

09

1

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-5 query plans)

Figure 9

convergence to ATC is lowest in the case of 119875119888 119875119898 =

085 001 Furthermore the graph showing ATC valuesaveraged over four datasets versus Top-119870 query plans after1000 generations (Figure 8) also shows that the lowest ATCvalues achieved are for 119875

119888 119875119898 = 085 001 Thus it

can be said that DQPGNSGA performs reasonably well for119875119888 119875119898 = 085 001 In order to compare DQPGNSGA with

SGA based DQPG algorithm (DQPGSGA) similar graphswere plotted for DQPGSGA These are shown in Figures 9 1011 12 and 13 It is noted from these graphs that DQPGSGA

The Scientific World Journal 13

0

02

04

06

08

1

12

14

1618

21 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-10 query plans)

Figure 10

0

02

04

06

08

1

12

14

16

18

2

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-15 query plans)

Figure 11

also achieves convergence to the lowest ATC values for119875119888 119875119898 = 085 001

Since the two algorithms DQPGNSGA and DQPGSGAconverge to a lower ATC value for the same crossover andmutation probabilities that is 119875

119888 119875119898 = 085 001 the

comparisons of the two algorithms can be carried out forthese observed probabilities

First the two algorithmsDQPGNSGA andDQPGSGA werecompared for each of the four datasets (Dataset-1 Dataset-2 Dataset-3 and Dataset-4) on the ATC values of Top-5Top-10 Top-15 and Top-20 query plans generated over 1000

0

04

08

12

16

2

24

28

32

36

4

1 54 107

160

213

266

319

372

425

478

531

584

637

690

743

796

849

902

955

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-20 query plans)

Figure 12

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

(after 1000 generations)

Aver

age t

otal

cost

(ATC

)

DQPGSGA

Top-K query plans

Figure 13

generations for 119875119888 119875119898 = 085 001 The corresponding

graphs for each dataset are shown in Figures 14 15 16 and 17It can be observed from the graphs thatDQPGNSGA convergesto lower ATC values for Top-5 Top-10 Top-15 and Top-20query plans generated for all datasets

Furthermore graphs comparing the ATC values of Top-119870 query plans produced by DQPGNSGA and DQPGSGA after1000 generations for the four datasets were plotted andare shown in Figures 18 19 20 and 21 These graphs alsoshow average TCC (ATCC) and average TPC (ATPC) valuesof Top-119870 query plans generated by DQPGNSGA It can beobserved from these graphs thatDQPGNSGA is able to achieve

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

4 The Scientific World Journal

The abovemultiobjectiveDQPGproblemhas been solvedusing the multiobjective genetic algorithm which is dis-cussed next

32 Multiobjective Genetic Algorithms Conceptualization ofmultiobjective problems using veridical models has a greatresemblance to many real world engineering and designproblems that involve more than one coextensive and oftencompeting objectives that is maximize profit maximizethroughput minimize cost minimize response time and soforth In such a scenario no single solution can be termedas optimal as in the case of single objective optimizationproblems but rather a set of alternative solutions can bevisualized as a tradeoff between the different objectives underconsideration This set of solutions is regarded superior toothers in the search space as no other recordedavailablesolution can better optimize all the objectives consideredtogether [37ndash39]

Multiobjective optimization approaches can be broadlyclassified into three categories [37] The approaches in thefirst two categories can be termed as the classical optimiza-tion approaches which combine all objectives into a singlecomposite function using some combination of arithmeticoperators ormove all but one objective into the constraint setThe approaches in the first category have limitations in regardto appropriate selection of weights and designing functions inaccordance to the problem It wouldmandate the user to havea priori knowledge of the behavior of each objective functionto some extent for providing the range of values to objectivesso that none of them dominate the others which is not alwayspossible [17] This approach is generally denominated asaggregating functions and it has been implemented at severaloccasionswith relative success in situationswhere behavior ofthe objective function ismore or lesswell-known Someof theaggregating functions include the weighted sum approachgoal programming 120576-constraintmethod and so forth [40] Inthe second approach moving the objectives into a constraintset requires that the boundary values for each of the objectivesbe known a priori which is almost impossible In either of thetwo cases the optimization method returns a single solutionrather than a set of solutions giving possible tradeoffs andtherefore the quality of solution in these approaches greatlydepends upon the correct problem formulation If feasiblethese would be the most efficient and simplest approacheswhich would give atleast sub optimal results in most cases

The third approach overcomes the problems faced inthe classical optimization approaches and emphasizes thedevelopment of alternative techniques based on exploringthe complete set of nondominated solutions and therebyenabling the decision maker to choose among the differentalternatives This set of solutions is referred to as the Paretooptimal set [13] A Pareto optimal set can be formally definedas a set of solutions that are nondominated with respectto each other that is replacing one solution with anotherwithin the Pareto optimal set will invariably lead to a lossto one objective against a gain obtained in another objective[41] Pareto optimal sets can have varied sizes but usuallythe size increases with increase in the number of objectives

[37 40] They are more preferred over single solutionsas they closely resemble real world problems where thedecision maker makes a decision based on tradeoffs betweenmultiple objectives A number of techniques were formulatedto generate the Pareto optimal set for example simulatedannealing [14] Tabu search [42] ant colony optimization[34] and so forth The problem with these algorithms wasthatmost often they get struck at local optima and thus renderit infeasible to venture out for identifying new tradeoffsEvolutionary algorithms such as GA on the other hand seemto be especially suited for this task as they enable parallelexploration of different areas in the search space eventuallyexploiting the solutions attained using operators such ascrossover and mutation [13] It would enable determiningmore members of a Pareto optimal set in a single run insteadof a series of runs required in other blind search strategiesAlso the evolutionary algorithms require very little a prioriknowledge of the problem at hand and therefore are lesssusceptible to the typical shape and continuity of the ParetofrontThe Pareto front can be defined as the points that lie onthe boundary of the Pareto optimal region These algorithmsthus avoid convergence to a suboptimal solution [43]

Mathematically a multiobjective optimization problemwith 119898 decision variables and 119899 objectives can be definedwithout any loss of generality as amaximization orminimiza-tion problem given by [13 38]

minimize = 119891 (119909) = (1198911 (119909) 1198912 (119909) 119891119899 (119909))

where119909 = (1199091 1199092 119909119898) isin 119883

119910 = (1199101 1199102 119910119899) isin 119884

(7)

Here 119909 is the decision vector 119883 refers to the parameterspace 119910 is the objective vector and 119884 defines the objectivespace These objectives may be conflicting in nature that isimprovement in one may lead to deterioration in anotherSo it may become impossible to optimize all objectivessimultaneously in a single solution Instead the best tradeoffsolution would be of interest to a decision maker Thesesolutions form a Pareto optimal set which was initially coinedby Edgeworth and Pareto and is formally defined as [13 38]

ldquoA decision vector 119886 is said to be Pareto optimal if andonly if 119886 is nondominated regarding 119883 A decision vector 119886is said to be nondominated regarding a set 1198831015840 sube 119883 if andonly if there is no vector in 119883

1015840 which dominates 119886 Formallyit can be defined aslt1198861015840 isin 119883

1015840

119886

1015840

≺ 119886rdquoAlso a decision vector 119886 isin 119883 is said to dominate a

decision vector 119887 isin 119883 (also written as 119886 ≺ 119887) if and onlyif

forall119894 isin 1 2 119899 119891119894(119886) le 119891

119894(119887)

forall119895 isin 1 2 119899 119891119895(119886) le 119891

119895(119887)

(8)

Several multiobjective algorithms exist in the literature[4 18ndash25 37 40 41 44 45] of whichGA basedmultiobjectiveoptimization algorithms have been widely used for solvingmultiobjective optimization problems In this paper NSGA-II has been used to solve the DQPG problem NSGA-II willbe discussed next

The Scientific World Journal 5

f(x

2)

f(x1)

minus 1

+ 1

Figure 1 Crowding distance for solution ldquoVrdquo [4]

321 NSGA-II The basis of NSGA-II [4] lies in the non-dominated sorting genetic algorithm (NSGA) introduced bySrinivas and Deb [22] As the name suggests NSGA usesnondominated ranking for each individual in the populationand assigns them accordingly into nondominated frontsTheindividuals in the first front or the nondominated individualsare then assigned large dummy fitness values All individualsin the front shared this fitness value based on a sharing func-tion Next the individuals in the second nondominated frontare considered and similarly assigned a dummy fitness lowerthan the fitness assigned in the previous front This processcontinues till the entire population is classified into frontsSince the solutions in the first front have themaximumfitnessvalue their chances of selection increase and eventually morecopies of such solutions get passed on to the next generationHowever NSGA suffered from some drawbacks such as highcomputational complexity119874(119898119873

3

) nonelitist approach andthe requirement of specifying a shared parameter [4] Theselimitations were addressed in NSGA-II proposed by Deb etal [4] as an improved version of NSGA [22] It alleviatesthe drawbacks in NSGA by reducing the computationalcomplexity to 119874(119898119873

2

) Further it uses a parameter-lesssharing approach by using a crowding distance measurefor selection The crowding distance is an estimate of thedensity of solutions surrounding a particular solution in theobjective space In Figure 1 the crowding distance of solutionrepresented as point V is computed as the average distancebetween the two closest solutions represented as points V minus 1

and V + 1 on either side of the points V along each of theobjectives 119891(1199091) and 119891(1199092)

NSGA-II uses a crowded-comparison operator for selec-tion which takes into account both the nondomination rankof a query plan in the population and its crowding distanceThe nondominated solutions are preferred over dominatedsolutions and between two solutions having the same ranka solution that resides in the less crowded region is preferredthat is a solution for which the crowding distance is higherTheNSGA-II does not use any externalmemory but it ensureselitism by combining the best parents with the best offspringobtained [19] In this paper an NSGA-II based multiobjective

DQPG algorithm is used to compute optimal query plansfor a given distributed query This algorithm is discussednext

33 NSGA-II Based DQPG Algorithm The proposed NSGA-II based DQPG algorithm takes the relations given in theFROM clause of the distributed query as input It arrangesthese relations in increasing order of their cardinalities Itthen generates a fixed set of feasible query plans (chromo-somes) based on the possible combinations of sites in whichthese relations are residing Each gene in a chromosomerepresents a relation and is arranged in increasing orderof the corresponding relationrsquos cardinality The value of agene represents the site in which the corresponding relationresides For example suppose that a query posed by the userhas 4 relations (1198771 1198772 1198773 and 1198774) arranged in ascendingorderThe relation 1198771 is stored in sites 1198781 and 1198783 1198772 is storedin 11987811198773 is stored in 1198781 and 1198782 and1198774 is stored in 1198781Then theinitial population of feasible query plans (chromosomes) canbe (1 1 1 1) (3 1 1 1) (3 1 2 1) and (1 1 2 1) This definesthe encoding scheme for the given problem The proposedDQPG algorithm based on NSGA-II is given in Algorithm 1The steps involved in this algorithm are discussed as follows

Step 1 (Initialize the Population [4 46]) A random popula-tion of query plans is generated as per the encoding schemediscussed above

Step 2 (EvaluateQuery Plans on theObjective Functions)For each of the query plans in the population the TCC andTPC values are computed as given below

TCC =

119894=119906minus1 119895=119906

sum

119894=1 119895=119894+1

CC119894119895times 119887119894

TPC =

119906

sum

119894=1

LPC119894times 119886119894

(9)

where 119906 is the number of sites accessed by the queryplan in ascending order of cardinality per site CC

119894119895is the

communication cost per byte between sites 119894 and 119895 LPC119894

is the local processing cost per byte at site 119894 119887119894is the bytes

to be communicated from site 119894 and 119886119894is the bytes to be

processed at site 119894 The procedure to compute CC119894119895 LPC

119894

119887119894 and 119886

119894is given in Section 21 If a site contains a single

relation its LPC is considered zero TCC and TPC needto be minimized simultaneously to achieve an acceptabletradeoff

Step 3 (Perform Nondominated Sort [4 46]) On the givenpopulation a fast nondominated sorting is performed in thefollowing manner

Twoobjective functions are consideredThefirst objectiveis to minimize the total processing cost (TPC) and thesecond objective is to minimize the total communicationcost (TCC) NSGA-II attempts to find a tradeoff betweenthese two objectives that can result in minimum total queryprocessing cost (TC)

6 The Scientific World Journal

Input 119877119894 Relations participating in the query 119875

119888 Probability of crossover

119875119898 Probability of mutation 119866 Pre-defined number of generations 119875size Population Size

Output Top-n query plansMethodInitialize a random parent population of query plans PPwhere chromosome length ldquolenrdquo is the number of relations accessed in the query and the gene at the 119896th position in thechromosome represents the site of the 119896th relationWHILE generation le 119866 DOStep 1 Evaluate each query plan in PP on the following objective functions

1198981 Minimize TCC =

119894=119906minus1119895=119906

sum

119894=1119895=119894+1

CC119894119895times 119887119894

1198982 Minimize TPC =

119906

sum

119894=1

LPC119894times 119886119894

where 119906 is the number of sites accessed by the query plan in ascending order of cardinality per site CC119894119895is the communication

cost per byte between sites 119894 and 119895 LPC119894is the local processing cost per byte at site 119894 119887

119894is the bytes to be communicated from

site 119894 and 119886119894is the bytes to be processed at site 119894

Step 2 Perform Non-Dominated (ND) Sort on PP for ldquo1198981rdquo and ldquo119898

2rdquo separately and place each query plan (QP)

into corresponding ND fronts ldquo119894rdquo and sort the QPs within each ldquo

119894rdquo

Step 3 Evaluate Crowding Distance Function 119868(119889) for each objective functionAssign 119868(119889) = infin for smallest and highest values in each front ldquo

119894rdquo

For the remaining QPs 119868(119889) is calculated as

119868 (119889119896) = 119868 (119889

119896) +

119868(119896 + 1)119898minus 119868(119896 minus 1)

119898

119891

max119898

minus 119891

min119898

where 119868(119896)119898is the value of119898th objective function of 119896th query plan in Front

119894and 119891

max119898

and 119891

min119898

are themaximum and minimum values obtained for the objective function119898Step 4 Perform Selection from PP using binary tournament selection using crowded comparison operator (≺

119899)

Step 5 Perform random single point crossover on selected chromosomes with crossover probability 119875119888

Step 6 Apply mutation on resulting population with mutation probability 119875119898

Let the resulting child population be CPStep 7 Append CP into PP and let the resulting intermediate population be IPStep 8 Repeat Step 1 and Step 2 for population IPStep 9 Form the population PP for the next generation by picking query plans Front-wise from IP till thepopulation size = 119875sizeStep 10 Increment Generation by 1ENDDOReturn Top-119870 Query Plans from population PP

Algorithm 1 NSGA-II based DQPG algorithm

In order to performanondominated sort each query planis compared with every other query plan in the population tofind if it is dominated For each query plan ldquo119894rdquo the followingtwo entities are considered

(i) 119899119894 The number of query plans that dominate the

query plan 119894(ii) 119878119894 The set of query plans that query plan 119894 dominates

All query plans that have 119899119894= 0 are added to the set

1

Set = 1where is called the current front For each

element 119909 in the current front visit each member 119895 in the set119878119909and reduce the count of 119899

119895by 1 Now if 119899

119895gets reduced to

zero for some 119895 add it to the set 2 After evaluating all the

members of in a similar manner set = 2 This process

continues till all the query plans are assigned some frontThe fast nondominated sorting procedure takes the currentpopulation as input and produces a list of nondominatedfronts

119894as output

Step 4 (Density Estimation Using Crowding Distance [446]) After the nondominated sort the crowding distanceis computed for each query plan in

119894 Crowding distance

[4] is an estimate of the density of solutions surrounding aparticular solution point in the population It is defined asthe average distance of the two closest points on either sidesof the given point along each of the objectives The crowdingdistance 119868(119889119894119904119905119886119899119888119890) is computed in the following manner[4 46]

For each front 119894 let 119873 be the number of query plans

in front 119894 Initially the crowding distance for each query

plan in the front 119894is zero That is 119868(119889

119895) = 0 119895 =

1 119873 Next for each objective function the query plansin the front

119894are sorted based on their value of TPC (ie

the first objective function) and similarly also with respectto TCC (ie the second objective function) and placed in119868(119889119894119904119905119886119899119888119890) The query plans having the smallest and thehighest 119868(119889119894119904119905119886119899119888119890) values in both sets are assigned an infinite

The Scientific World Journal 7

Rejected

Crowding distance sort PP

CP

IP

PP

Nondominatedsort

Ŧ1

Ŧ2

Ŧ3

Figure 2 NSGA-II procedure preserving elitism [4]

value for 119868(119889119894119904119905119886119899119888119890) that is 119868(1198891) = infin and 119868(119889

119873) = infin For

remaining query plans that is 119896 = 2 119873 minus 1 119868(119889119894119904119905119886119899119888119890)is computed as follows [4 46]

119868 (119889119896) = 119868 (119889

119896) +

119868(119896 + 1)119898minus 119868(119896 minus 1)

119898

119891

max119898

minus 119891

min119898

(10)

where 119868(119896)119898is the value of 119898th objective function of 119896th

query plan in front 119894and 119891

max119898

and 119891

min119898

are the maximumand minimum values obtained for the objective function119898

Step 5 (Binary Tournament Selection) After assigning thecrowding distance to the query plans in each front a selectionprocess is carried outThe selection scheme used is the binarytournament selection and it is carried out using the crowdedcomparison operator (≺

119899) [4 46] It uses two parameters as

given below(i) 120588rank (nondomination rank)The query plans in front

119894will have 120588rank = 119894

(ii) 119868119894(119889119895) (crowding distance in front 119894) 119895 = 1 119873

The crowded comparison is performed as described below [446]

For any two query plans 1199021199011and 119902119901

2 1199021199011is selected if

(1199021199011≺1198991199021199012) 1199021199011≺1198991199021199012is true if either one of the following

holds(i) 120588rank(1199021199011) lt 120588rank(1199021199012)(ii) if 119902119901

1and 119902119901

2belong to the same front (ie if

120588rank(1199021199011) = 120588rank(1199021199012)) then 119868119894(1198891199021199011

) gt 119868119894(1198891199021199012

)

Step 6 (Crossover andMutation) Crossover is performed onthe selected query plans with a given crossover probability119875119888 It ensures proper exploration of the search space by

combining the best features of the parent query plans (chro-mosomes) Mutation is performed on the given populationwith a given probability 119875

119898 It randomly changes the site

(gene) in which the corresponding relation resides within aquery plan (chromosome) The mutated gene always takes arandom value from a set of valid sites for a particular relationAfter going through the above steps the first generationpopulation is formed NSGA-II follows a different methodto produce subsequent generations in order to incorporateelitism as described next

Table 1 Initial parent population PP

[2 1 5 1] [3 5 4 5]

[3 3 1 1] [3 1 1 1]

[3 3 3 3] [3 4 4 4]

[2 4 1 5] [2 3 3 3]

[3 1 1 3] [2 1 3 4]

Step 7 (Preserving Good Solutions (Elitism) [4]) In subse-quent generations the new population after each generationis combined with the parent population and a new interme-diate population IP is created of size PP + CP where PP is theparent population and CP is the child population as shown inFigure 2

The non-dominated sort is applied to this intermediatepopulation and fronts are formed as described in Step 3Finally the population for the next generation is formedby adding solutions from each front till the population sizeexceeds 119875 If the last front to be included was

119888 which led

to the population overflow then query plans in Front 119888are

selected based on their crowding distance measure (Step 4)in descending order until the population size exceeds 119875

The above steps are repeated for ldquo119866rdquo generations and theTop-119870 query plans are produced as output

An example illustrating the use of the above NSGA-IIbased DQPG algorithm to generate query plans for a givendistributed query is given next

4 An Example

Consider the site relation matrix the communication-cost matrix the local processing cost matrix the distinct-tuple matrix and the size matrix used to compute thefitness of query plans given in [3] and shown below inFigure 3 Suppose the initial parent population PP com-prises of 10 query plans given in Table 1 Consider aquery that accesses four relations (1198771 1198772 1198773 and 1198774)which are distributed among five sites (1198781 1198782 1198783 1198784 and1198785)

8 The Scientific World Journal

R1 R2 R3 R4300200 500 700

Cardinality

R1 R2 R3 R430 40 50 20

Size

S1 S2 S3 S4 S52 2 3 3 2

S1 S2 S3 S4 S5

S1 150 180 150 160

S2 165 185 170 170

S3 160 160 170 150

S4 150 150 180 160

S5 170 160 190 150

Communication costmatrix

R1 R2 R3 R4

R1 04 05 04 04

R2 05 02 02 03

R3 04 02 02 01

R4 04 03 01 03

matrix for join

R1 R2 R3 R4S1 0 1 1 1S2 1 0 0 0S3 1 1 1 1S4 0 1 1 1S5 0 0 1 1

matrixLPCi

Distinct-tuple- Relation-site

mdash

mdashmdash

mdash

mdash

Figure 3 Matrices used for computing fitness [3]

Table 2 Nondominated sort on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC 119878

119894119899119894

Front[1] [2 1 5 1] 18020000 3746700 [4 10] 1

2

[2] [3 3 1 1] 6720000 6006000 [4 10] 1 2

[3] [3 3 3 3] 0 4200000 [2 4 7 8 9 10] mdash 1

[4] [2 4 1 5] 43440000 70793333 [10] 6 3

[5] [3 1 1 3] 14000000 2112000 [1 4 6 10] mdash 1

[6] [3 5 4 5] 17020000 3847000 [4 10] 1 2

[7] [3 1 1 1] 960000 6500000 [4 8 9 10] 1 2

[8] [3 4 4 4] 1020000 9750000 [10] 2 3

[9] [2 3 3 3] 1110000 9750000 [10] 2 3

[10] [2 1 3 4] 45090000 10514000 mdash 9 4

Table 3 Sorted fronts based on TCC and TPC

Query plans sorted on1198981= TCC

Query plans sorted on1198982= TPC

1

3 5 1

5 32

7 2 6 1 2

1 6 2 73

8 9 4 3

4 8 94

10 4

10

The computations of TPC and TCC for the query plan[2 4 1 5] are given as follows

TPC4= (LPC

2times 1198862) + (LPC

4times 1198864)

+ (LPC5times 1198865) + (LPC

1times 1198861)

TCC4= (CC

24times 11988724

) + (CC45

times 11988745

) + (CC51

times 11988751

)

(11)

whereLPC2times 1198862= 0

CC24

times 11988724

= CC24

times (Card (1198771) times Size (1198771))

= 170 times (200 times 30) = 1020000

LPC4times 1198864= LPC

4times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)) )

= 2 times (

200 times 300

05 times 200

times (70)) = 126000

CC45

times 11988745

= CC45

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)))

= 170 times (

200 times 300

05 times 200

times (70)) = 6720000

LPC5times 1198865

= LPC5

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 420000

The Scientific World Journal 9

Table 4 Results for the objective functions on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC Front Crowding distance (CD) = 119868[119894]

[1] [2 1 5 1] 18020000 3746700 2

infin

[2] [3 3 1 1] 6720000 6006000 2

0672[3] [3 3 3 3] 0 4200000

1infin

[4] [2 4 1 5] 43440000 70793333 3

infin

[5] [3 1 1 3] 14000000 2112000 1

infin

[6] [3 5 4 5] 17020000 3847000 2

05195[7] [3 1 1 1] 960000 6500000

2infin

[8] [3 4 4 4] 1020000 9750000 3

infin

[9] [2 3 3 3] 1110000 9750000 3

infin

[10] [2 1 3 4] 45090000 10514000 4

infin

Table 5 Selection of query plans using binary tournament selection technique

IndexRandomly generated

indexes[119894] and [119895]

Tournament betweenquery plans

[119875(119894)] and [119875(119895)]

Front comparison Query plan selected (lowerfront or higher CD)

[1] [1] and [4] [2 1 5 1] and [2 4 1 5] [2] and [

3] [2 1 5 1]

[2] [2] and [3] [3 3 1 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[3] [3] and [4] [3 3 3 3] and [2 4 1 5] [1] and [

3] [3 3 3 3]

[4] [2] and [6] [3 3 1 1] and [3 5 4 5] [2] and [

2] [3 3 1 1]

[5] [2] and [4] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

[6] [1] and [3] [2 1 5 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[7] [2] and [5] [3 3 1 1] and [3 1 1 3] [2] and [

1] [3 1 1 3]

[8] [2] and [8] [3 3 1 1] and [3 1 1 3] [2] and [

3] [3 3 1 1]

[9] [7] and [8] [3 1 1 1] and [3 1 1 3] [2] and [

3] [3 1 1 1]

[10] [2] and [9] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

Table 6 Population CP after crossover and mutation

Index Population after crossover Population aftermutation (CP)

[1] [2 1 1 1] [2 1 1 1]

[2] [3 3 1 1] [3 3 1 1]

[3] [3 3 5 1] [3 3 4 1][4] [3 3 1 1] [3 3 1 1]

[5] [3 3 3 1] [3 3 3 1]

[6] [3 3 3 3] [3 3 3 3]

[7] [3 1 1 3] [3 1 1 3]

[8] [3 3 1 3] [3 3 1 3]

[9] [3 1 3 3] [3 3 3 3]

[10] [3 3 1 1] [3 3 1 1]

CC51

times 11988751

= CC51

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

170 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 35700000

LPC1times 1198861

= LPC1times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times

Card (1198773)

Dist43

times Card (1198773)

times (

4

sum

119894=1

Size (119877119894)))

10 The Scientific World Journal

Table 7 Front assignment for query plans in IP

Index Intermediate population IP TCC TPC Front[1] [2 1 5 1] 18020000 3747000

2

[2] [3 3 1 1] 6720000 6006000 3

[3] [3 3 3 3] 0 4200000 1

[4] [2 4 1 5] 43440000 7079000 4

[5] [3 1 1 3] 14000000 2112000 1

[6] [3 5 4 5] 17020000 3847000 2

[7] [3 1 1 1] 960000 6500000 2

[8] [3 1 1 3] 1020000 9750000 3

[9] [2 3 3 3] 1110000 9750000 3

[10] [2 1 3 4] 45090000 10514000 5

[11] [2 1 1 1] 990000 6500000 2

[12] [3 3 1 1] 6720000 6006000 3

[13] [3 3 4 1] 90300000 8946000 5

[14] [3 3 1 1] 6720000 6006000 3

[15] [3 3 3 1] 2520000 4230000 2

[16] [3 3 3 3] 0 4200000 1

[17] [3 1 1 3] 14000000 2112000 1

[18] [3 3 1 3] 4500000 2640000 1

[19] [3 3 3 3] 0 4200000 1

[20] [3 3 1 1] 6720000 6006000 3

Table 8 Nondominated sorting for query plans in IP

Index Intermediatepopulation IP Front Crowding

distance[3] [3 3 3 3]

1

[5] [3 1 1 3] 1

[16] [3 3 3 3] 1

[17] [3 1 1 3] 1

[18] [3 3 1 3] 1

[19] [3 3 3 3] 1

[1] [2 1 5 1] 2 infin

[7] [3 1 1 1] 2 infin

[11] [2 1 1 1] 2 infin

[15] [3 3 3 1] 2 04933

[6] [3 5 4 5] 2 02292

[2] [3 3 1 1] 3

[9] [2 3 3 3] 3

[8] [3 1 1 3] 3

[12] [3 3 1 1] 3

[14] [3 3 1 1] 3

[20] [3 3 1 1] 3

[4] [2 4 1 5] 4

[10] [2 1 3 4] 5

[13] [3 3 4 1] 4

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

times

500

01 times 500

) times 140 = 65333333

(12)

Table 9 Population PP in the 2nd generation

Index Population PP[1] [3 3 3 3]

[2] [3 1 1 3]

[3] [3 3 3 3]

[4] [3 1 1 3]

[5] [3 3 1 3]

[6] [3 3 3 3]

[7] [2 1 5 1]

[8] [3 1 1 1]

[9] [2 1 1 1]

[10] [3 3 3 1]

So

TPC4= 126000 + 420000 + 65333333 = 70793333

TCC4= 1020000 + 6720000 + 35700000 = 43440000

(13)

Similarly TCC and TPC values of the other nine queryplans are computed Consider TCC and TPC of the 10 queryplans are given in Table 2 The population is then sorted intodifferent nondominated fronts as described in Step 3 of theproposed algorithm For example for query plan 1 that is[2 1 5 1] the set 119878

1= number of query plans that dominate

query plan 1 Since TCC [1] lt TCC [4] TPC [1] lt TPC [4]TCC [1] lt TCC [10] and TPC [1] lt TPC [10] the elementsof 1198781

= 4 10 Similarly the sets 1198782 119878

10are computed

and are given in Table 2 119899119894stores the count of query plans

that dominate 119894 So using the values in 119878119894 1198991

= 1 as onlyquery plan 5 is dominating 1 and 119899

2= 1 as only query plan 3

is dominating 2 Similarly 1198992 119899

10are computed and are

The Scientific World Journal 11

given in Table 2 From Table 2 it can be noted that queryplans 3 and 5 are not dominated as 119899

3= 0 119899

5= 0 So they

are assigned to the first nondominated front1The elementsin the next front are computed by reducing the count in 119899

119894

for each 119894 isin 1198783and 119894 isin 119878

5 So 119899

119894= 0 0 minus 4 minus 0 0 1 1 7

So the second front has query plans 1 2 6 and 7 Thisprocess continues till all the query plans in the population areassigned to their respective nondominated fronts The fronts1 2 3 and

4are formed and are given in Table 2

Finally the query plans are sorted separately on the valuesof TCC and TPC within each front as shown in Table 3

After the population is sorted into different fronts thecrowding distance (119868[119889119894119904119905119886119899119888119890]) computation is performedfor each query plan using the formula given in Step 4 ofthe proposed algorithm Query plans having the maximumand minimum values in each front are assigned infin distancevalues that is 119868[3] 119868[5] 119868[7] 119868[1] 119868[8] 119868[4] 119868[9] and119868[10] are assigned infin For the rest of the query plans theircrowding distance (CD) values are computed based on thesorted order of query plans in each front Initially they areassigned 119868[119889119894119904119905119886119899119888119890] = 0 For query plan 2 in front

2 the

CD computation is performed as follows

1198681198981[2] = 119868 [2] +

TCC [6] minus TCC [7]

max (TCC) minusmin (TCC)

= 0 +

17020000 minus 960000

45090000 minus 0

= 03562

119868 [2] = 1198681198981[2] +

TPC [7] minus TPC [6]

max (TPC) minusmin (TPC)

= 03562 +

6500000 minus 3847000

10514000 minus 2112000

= 06720

(14)

CD of the query plans in the given population is given inTable 4

Next binary tournament selection is performed on thepopulation on the basis of crowded comparison operator ≺

119899

This selection process is shown in Table 5The selected query plans undergo random single point

crossover with crossover probability 119875119888

= 05 Mutation isperformed on the selected population with mutation proba-bility 119875

119898= 002 The child population CP after crossover and

mutation is shown in Table 6Now in accordance with NSGA-II algorithm the popu-

lations from the second generation onward have to ensureelitism For this purpose the child population CP is com-bined with the parent population PP to generate intermediatepopulation IP This population is subjected to nondominatedsort and fronts are formed as given in Table 7

The population for the second generation is arrived atby selecting query plans based on front and within it basedon crowding distance-wise as described in Step 7 from theintermediate population IP till the actual population size 10is exceeded This selection is shown in Table 8

The population PP for the second generation is given inTable 9

The above process is repeated till a predefined number ofgenerations119866have elapsedThereafter Top-119870 query plans areproduced as output

0

01

02

03

04

05

06

07

08

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-5 query plans)

Figure 4

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

DQPGNSGA(top-10 query plans)

Figure 5

5 Experimental Results

The proposed NSGA-II based algorithm is implemented inMATLAB 77 in Windows 7 professional 64 bit OS with Intelcore i3CPUat 213GHzhaving 4GBRAMExperimentswerecarried out for a population of 100 query plans with eachquery plan involving 10 relations distributed over 50 sitesThese were performed on four datasets each comprising adifferent relation-site matrix Graphs were plotted to observechange in average TC (ATC) with respect to generations

12 The Scientific World Journal

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-15 query plans)

Figure 6

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-20 query plans)

Figure 7

and Top-119870 query plans for different pairs of crossover andmutation rates

First graphs showing the ATC values Top-5 Top-10Top-15 and Top-20 query plans averaged over four datasetsover 1000 generations for distinct pairs of crossover andmutation probability 119875

119888 119875119898 = 07 001 07 0005

075 001 075 0005 08 001 08 0005 085 001085 0005 using NSGA-II based DQPG algorithm(DQPGNSGA) were plotted These are shown in Figures 45 6 and 7 It can be observed from these graphs that the

0

005

01

015

02

025

03

035

04

045

05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(after 1000 generations)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGNSGA

Top-K query plans

Figure 8

0

01

02

03

04

05

06

07

08

09

1

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-5 query plans)

Figure 9

convergence to ATC is lowest in the case of 119875119888 119875119898 =

085 001 Furthermore the graph showing ATC valuesaveraged over four datasets versus Top-119870 query plans after1000 generations (Figure 8) also shows that the lowest ATCvalues achieved are for 119875

119888 119875119898 = 085 001 Thus it

can be said that DQPGNSGA performs reasonably well for119875119888 119875119898 = 085 001 In order to compare DQPGNSGA with

SGA based DQPG algorithm (DQPGSGA) similar graphswere plotted for DQPGSGA These are shown in Figures 9 1011 12 and 13 It is noted from these graphs that DQPGSGA

The Scientific World Journal 13

0

02

04

06

08

1

12

14

1618

21 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-10 query plans)

Figure 10

0

02

04

06

08

1

12

14

16

18

2

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-15 query plans)

Figure 11

also achieves convergence to the lowest ATC values for119875119888 119875119898 = 085 001

Since the two algorithms DQPGNSGA and DQPGSGAconverge to a lower ATC value for the same crossover andmutation probabilities that is 119875

119888 119875119898 = 085 001 the

comparisons of the two algorithms can be carried out forthese observed probabilities

First the two algorithmsDQPGNSGA andDQPGSGA werecompared for each of the four datasets (Dataset-1 Dataset-2 Dataset-3 and Dataset-4) on the ATC values of Top-5Top-10 Top-15 and Top-20 query plans generated over 1000

0

04

08

12

16

2

24

28

32

36

4

1 54 107

160

213

266

319

372

425

478

531

584

637

690

743

796

849

902

955

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-20 query plans)

Figure 12

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

(after 1000 generations)

Aver

age t

otal

cost

(ATC

)

DQPGSGA

Top-K query plans

Figure 13

generations for 119875119888 119875119898 = 085 001 The corresponding

graphs for each dataset are shown in Figures 14 15 16 and 17It can be observed from the graphs thatDQPGNSGA convergesto lower ATC values for Top-5 Top-10 Top-15 and Top-20query plans generated for all datasets

Furthermore graphs comparing the ATC values of Top-119870 query plans produced by DQPGNSGA and DQPGSGA after1000 generations for the four datasets were plotted andare shown in Figures 18 19 20 and 21 These graphs alsoshow average TCC (ATCC) and average TPC (ATPC) valuesof Top-119870 query plans generated by DQPGNSGA It can beobserved from these graphs thatDQPGNSGA is able to achieve

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

The Scientific World Journal 5

f(x

2)

f(x1)

minus 1

+ 1

Figure 1 Crowding distance for solution ldquoVrdquo [4]

321 NSGA-II The basis of NSGA-II [4] lies in the non-dominated sorting genetic algorithm (NSGA) introduced bySrinivas and Deb [22] As the name suggests NSGA usesnondominated ranking for each individual in the populationand assigns them accordingly into nondominated frontsTheindividuals in the first front or the nondominated individualsare then assigned large dummy fitness values All individualsin the front shared this fitness value based on a sharing func-tion Next the individuals in the second nondominated frontare considered and similarly assigned a dummy fitness lowerthan the fitness assigned in the previous front This processcontinues till the entire population is classified into frontsSince the solutions in the first front have themaximumfitnessvalue their chances of selection increase and eventually morecopies of such solutions get passed on to the next generationHowever NSGA suffered from some drawbacks such as highcomputational complexity119874(119898119873

3

) nonelitist approach andthe requirement of specifying a shared parameter [4] Theselimitations were addressed in NSGA-II proposed by Deb etal [4] as an improved version of NSGA [22] It alleviatesthe drawbacks in NSGA by reducing the computationalcomplexity to 119874(119898119873

2

) Further it uses a parameter-lesssharing approach by using a crowding distance measurefor selection The crowding distance is an estimate of thedensity of solutions surrounding a particular solution in theobjective space In Figure 1 the crowding distance of solutionrepresented as point V is computed as the average distancebetween the two closest solutions represented as points V minus 1

and V + 1 on either side of the points V along each of theobjectives 119891(1199091) and 119891(1199092)

NSGA-II uses a crowded-comparison operator for selec-tion which takes into account both the nondomination rankof a query plan in the population and its crowding distanceThe nondominated solutions are preferred over dominatedsolutions and between two solutions having the same ranka solution that resides in the less crowded region is preferredthat is a solution for which the crowding distance is higherTheNSGA-II does not use any externalmemory but it ensureselitism by combining the best parents with the best offspringobtained [19] In this paper an NSGA-II based multiobjective

DQPG algorithm is used to compute optimal query plansfor a given distributed query This algorithm is discussednext

33 NSGA-II Based DQPG Algorithm The proposed NSGA-II based DQPG algorithm takes the relations given in theFROM clause of the distributed query as input It arrangesthese relations in increasing order of their cardinalities Itthen generates a fixed set of feasible query plans (chromo-somes) based on the possible combinations of sites in whichthese relations are residing Each gene in a chromosomerepresents a relation and is arranged in increasing orderof the corresponding relationrsquos cardinality The value of agene represents the site in which the corresponding relationresides For example suppose that a query posed by the userhas 4 relations (1198771 1198772 1198773 and 1198774) arranged in ascendingorderThe relation 1198771 is stored in sites 1198781 and 1198783 1198772 is storedin 11987811198773 is stored in 1198781 and 1198782 and1198774 is stored in 1198781Then theinitial population of feasible query plans (chromosomes) canbe (1 1 1 1) (3 1 1 1) (3 1 2 1) and (1 1 2 1) This definesthe encoding scheme for the given problem The proposedDQPG algorithm based on NSGA-II is given in Algorithm 1The steps involved in this algorithm are discussed as follows

Step 1 (Initialize the Population [4 46]) A random popula-tion of query plans is generated as per the encoding schemediscussed above

Step 2 (EvaluateQuery Plans on theObjective Functions)For each of the query plans in the population the TCC andTPC values are computed as given below

TCC =

119894=119906minus1 119895=119906

sum

119894=1 119895=119894+1

CC119894119895times 119887119894

TPC =

119906

sum

119894=1

LPC119894times 119886119894

(9)

where 119906 is the number of sites accessed by the queryplan in ascending order of cardinality per site CC

119894119895is the

communication cost per byte between sites 119894 and 119895 LPC119894

is the local processing cost per byte at site 119894 119887119894is the bytes

to be communicated from site 119894 and 119886119894is the bytes to be

processed at site 119894 The procedure to compute CC119894119895 LPC

119894

119887119894 and 119886

119894is given in Section 21 If a site contains a single

relation its LPC is considered zero TCC and TPC needto be minimized simultaneously to achieve an acceptabletradeoff

Step 3 (Perform Nondominated Sort [4 46]) On the givenpopulation a fast nondominated sorting is performed in thefollowing manner

Twoobjective functions are consideredThefirst objectiveis to minimize the total processing cost (TPC) and thesecond objective is to minimize the total communicationcost (TCC) NSGA-II attempts to find a tradeoff betweenthese two objectives that can result in minimum total queryprocessing cost (TC)

6 The Scientific World Journal

Input 119877119894 Relations participating in the query 119875

119888 Probability of crossover

119875119898 Probability of mutation 119866 Pre-defined number of generations 119875size Population Size

Output Top-n query plansMethodInitialize a random parent population of query plans PPwhere chromosome length ldquolenrdquo is the number of relations accessed in the query and the gene at the 119896th position in thechromosome represents the site of the 119896th relationWHILE generation le 119866 DOStep 1 Evaluate each query plan in PP on the following objective functions

1198981 Minimize TCC =

119894=119906minus1119895=119906

sum

119894=1119895=119894+1

CC119894119895times 119887119894

1198982 Minimize TPC =

119906

sum

119894=1

LPC119894times 119886119894

where 119906 is the number of sites accessed by the query plan in ascending order of cardinality per site CC119894119895is the communication

cost per byte between sites 119894 and 119895 LPC119894is the local processing cost per byte at site 119894 119887

119894is the bytes to be communicated from

site 119894 and 119886119894is the bytes to be processed at site 119894

Step 2 Perform Non-Dominated (ND) Sort on PP for ldquo1198981rdquo and ldquo119898

2rdquo separately and place each query plan (QP)

into corresponding ND fronts ldquo119894rdquo and sort the QPs within each ldquo

119894rdquo

Step 3 Evaluate Crowding Distance Function 119868(119889) for each objective functionAssign 119868(119889) = infin for smallest and highest values in each front ldquo

119894rdquo

For the remaining QPs 119868(119889) is calculated as

119868 (119889119896) = 119868 (119889

119896) +

119868(119896 + 1)119898minus 119868(119896 minus 1)

119898

119891

max119898

minus 119891

min119898

where 119868(119896)119898is the value of119898th objective function of 119896th query plan in Front

119894and 119891

max119898

and 119891

min119898

are themaximum and minimum values obtained for the objective function119898Step 4 Perform Selection from PP using binary tournament selection using crowded comparison operator (≺

119899)

Step 5 Perform random single point crossover on selected chromosomes with crossover probability 119875119888

Step 6 Apply mutation on resulting population with mutation probability 119875119898

Let the resulting child population be CPStep 7 Append CP into PP and let the resulting intermediate population be IPStep 8 Repeat Step 1 and Step 2 for population IPStep 9 Form the population PP for the next generation by picking query plans Front-wise from IP till thepopulation size = 119875sizeStep 10 Increment Generation by 1ENDDOReturn Top-119870 Query Plans from population PP

Algorithm 1 NSGA-II based DQPG algorithm

In order to performanondominated sort each query planis compared with every other query plan in the population tofind if it is dominated For each query plan ldquo119894rdquo the followingtwo entities are considered

(i) 119899119894 The number of query plans that dominate the

query plan 119894(ii) 119878119894 The set of query plans that query plan 119894 dominates

All query plans that have 119899119894= 0 are added to the set

1

Set = 1where is called the current front For each

element 119909 in the current front visit each member 119895 in the set119878119909and reduce the count of 119899

119895by 1 Now if 119899

119895gets reduced to

zero for some 119895 add it to the set 2 After evaluating all the

members of in a similar manner set = 2 This process

continues till all the query plans are assigned some frontThe fast nondominated sorting procedure takes the currentpopulation as input and produces a list of nondominatedfronts

119894as output

Step 4 (Density Estimation Using Crowding Distance [446]) After the nondominated sort the crowding distanceis computed for each query plan in

119894 Crowding distance

[4] is an estimate of the density of solutions surrounding aparticular solution point in the population It is defined asthe average distance of the two closest points on either sidesof the given point along each of the objectives The crowdingdistance 119868(119889119894119904119905119886119899119888119890) is computed in the following manner[4 46]

For each front 119894 let 119873 be the number of query plans

in front 119894 Initially the crowding distance for each query

plan in the front 119894is zero That is 119868(119889

119895) = 0 119895 =

1 119873 Next for each objective function the query plansin the front

119894are sorted based on their value of TPC (ie

the first objective function) and similarly also with respectto TCC (ie the second objective function) and placed in119868(119889119894119904119905119886119899119888119890) The query plans having the smallest and thehighest 119868(119889119894119904119905119886119899119888119890) values in both sets are assigned an infinite

The Scientific World Journal 7

Rejected

Crowding distance sort PP

CP

IP

PP

Nondominatedsort

Ŧ1

Ŧ2

Ŧ3

Figure 2 NSGA-II procedure preserving elitism [4]

value for 119868(119889119894119904119905119886119899119888119890) that is 119868(1198891) = infin and 119868(119889

119873) = infin For

remaining query plans that is 119896 = 2 119873 minus 1 119868(119889119894119904119905119886119899119888119890)is computed as follows [4 46]

119868 (119889119896) = 119868 (119889

119896) +

119868(119896 + 1)119898minus 119868(119896 minus 1)

119898

119891

max119898

minus 119891

min119898

(10)

where 119868(119896)119898is the value of 119898th objective function of 119896th

query plan in front 119894and 119891

max119898

and 119891

min119898

are the maximumand minimum values obtained for the objective function119898

Step 5 (Binary Tournament Selection) After assigning thecrowding distance to the query plans in each front a selectionprocess is carried outThe selection scheme used is the binarytournament selection and it is carried out using the crowdedcomparison operator (≺

119899) [4 46] It uses two parameters as

given below(i) 120588rank (nondomination rank)The query plans in front

119894will have 120588rank = 119894

(ii) 119868119894(119889119895) (crowding distance in front 119894) 119895 = 1 119873

The crowded comparison is performed as described below [446]

For any two query plans 1199021199011and 119902119901

2 1199021199011is selected if

(1199021199011≺1198991199021199012) 1199021199011≺1198991199021199012is true if either one of the following

holds(i) 120588rank(1199021199011) lt 120588rank(1199021199012)(ii) if 119902119901

1and 119902119901

2belong to the same front (ie if

120588rank(1199021199011) = 120588rank(1199021199012)) then 119868119894(1198891199021199011

) gt 119868119894(1198891199021199012

)

Step 6 (Crossover andMutation) Crossover is performed onthe selected query plans with a given crossover probability119875119888 It ensures proper exploration of the search space by

combining the best features of the parent query plans (chro-mosomes) Mutation is performed on the given populationwith a given probability 119875

119898 It randomly changes the site

(gene) in which the corresponding relation resides within aquery plan (chromosome) The mutated gene always takes arandom value from a set of valid sites for a particular relationAfter going through the above steps the first generationpopulation is formed NSGA-II follows a different methodto produce subsequent generations in order to incorporateelitism as described next

Table 1 Initial parent population PP

[2 1 5 1] [3 5 4 5]

[3 3 1 1] [3 1 1 1]

[3 3 3 3] [3 4 4 4]

[2 4 1 5] [2 3 3 3]

[3 1 1 3] [2 1 3 4]

Step 7 (Preserving Good Solutions (Elitism) [4]) In subse-quent generations the new population after each generationis combined with the parent population and a new interme-diate population IP is created of size PP + CP where PP is theparent population and CP is the child population as shown inFigure 2

The non-dominated sort is applied to this intermediatepopulation and fronts are formed as described in Step 3Finally the population for the next generation is formedby adding solutions from each front till the population sizeexceeds 119875 If the last front to be included was

119888 which led

to the population overflow then query plans in Front 119888are

selected based on their crowding distance measure (Step 4)in descending order until the population size exceeds 119875

The above steps are repeated for ldquo119866rdquo generations and theTop-119870 query plans are produced as output

An example illustrating the use of the above NSGA-IIbased DQPG algorithm to generate query plans for a givendistributed query is given next

4 An Example

Consider the site relation matrix the communication-cost matrix the local processing cost matrix the distinct-tuple matrix and the size matrix used to compute thefitness of query plans given in [3] and shown below inFigure 3 Suppose the initial parent population PP com-prises of 10 query plans given in Table 1 Consider aquery that accesses four relations (1198771 1198772 1198773 and 1198774)which are distributed among five sites (1198781 1198782 1198783 1198784 and1198785)

8 The Scientific World Journal

R1 R2 R3 R4300200 500 700

Cardinality

R1 R2 R3 R430 40 50 20

Size

S1 S2 S3 S4 S52 2 3 3 2

S1 S2 S3 S4 S5

S1 150 180 150 160

S2 165 185 170 170

S3 160 160 170 150

S4 150 150 180 160

S5 170 160 190 150

Communication costmatrix

R1 R2 R3 R4

R1 04 05 04 04

R2 05 02 02 03

R3 04 02 02 01

R4 04 03 01 03

matrix for join

R1 R2 R3 R4S1 0 1 1 1S2 1 0 0 0S3 1 1 1 1S4 0 1 1 1S5 0 0 1 1

matrixLPCi

Distinct-tuple- Relation-site

mdash

mdashmdash

mdash

mdash

Figure 3 Matrices used for computing fitness [3]

Table 2 Nondominated sort on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC 119878

119894119899119894

Front[1] [2 1 5 1] 18020000 3746700 [4 10] 1

2

[2] [3 3 1 1] 6720000 6006000 [4 10] 1 2

[3] [3 3 3 3] 0 4200000 [2 4 7 8 9 10] mdash 1

[4] [2 4 1 5] 43440000 70793333 [10] 6 3

[5] [3 1 1 3] 14000000 2112000 [1 4 6 10] mdash 1

[6] [3 5 4 5] 17020000 3847000 [4 10] 1 2

[7] [3 1 1 1] 960000 6500000 [4 8 9 10] 1 2

[8] [3 4 4 4] 1020000 9750000 [10] 2 3

[9] [2 3 3 3] 1110000 9750000 [10] 2 3

[10] [2 1 3 4] 45090000 10514000 mdash 9 4

Table 3 Sorted fronts based on TCC and TPC

Query plans sorted on1198981= TCC

Query plans sorted on1198982= TPC

1

3 5 1

5 32

7 2 6 1 2

1 6 2 73

8 9 4 3

4 8 94

10 4

10

The computations of TPC and TCC for the query plan[2 4 1 5] are given as follows

TPC4= (LPC

2times 1198862) + (LPC

4times 1198864)

+ (LPC5times 1198865) + (LPC

1times 1198861)

TCC4= (CC

24times 11988724

) + (CC45

times 11988745

) + (CC51

times 11988751

)

(11)

whereLPC2times 1198862= 0

CC24

times 11988724

= CC24

times (Card (1198771) times Size (1198771))

= 170 times (200 times 30) = 1020000

LPC4times 1198864= LPC

4times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)) )

= 2 times (

200 times 300

05 times 200

times (70)) = 126000

CC45

times 11988745

= CC45

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)))

= 170 times (

200 times 300

05 times 200

times (70)) = 6720000

LPC5times 1198865

= LPC5

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 420000

The Scientific World Journal 9

Table 4 Results for the objective functions on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC Front Crowding distance (CD) = 119868[119894]

[1] [2 1 5 1] 18020000 3746700 2

infin

[2] [3 3 1 1] 6720000 6006000 2

0672[3] [3 3 3 3] 0 4200000

1infin

[4] [2 4 1 5] 43440000 70793333 3

infin

[5] [3 1 1 3] 14000000 2112000 1

infin

[6] [3 5 4 5] 17020000 3847000 2

05195[7] [3 1 1 1] 960000 6500000

2infin

[8] [3 4 4 4] 1020000 9750000 3

infin

[9] [2 3 3 3] 1110000 9750000 3

infin

[10] [2 1 3 4] 45090000 10514000 4

infin

Table 5 Selection of query plans using binary tournament selection technique

IndexRandomly generated

indexes[119894] and [119895]

Tournament betweenquery plans

[119875(119894)] and [119875(119895)]

Front comparison Query plan selected (lowerfront or higher CD)

[1] [1] and [4] [2 1 5 1] and [2 4 1 5] [2] and [

3] [2 1 5 1]

[2] [2] and [3] [3 3 1 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[3] [3] and [4] [3 3 3 3] and [2 4 1 5] [1] and [

3] [3 3 3 3]

[4] [2] and [6] [3 3 1 1] and [3 5 4 5] [2] and [

2] [3 3 1 1]

[5] [2] and [4] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

[6] [1] and [3] [2 1 5 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[7] [2] and [5] [3 3 1 1] and [3 1 1 3] [2] and [

1] [3 1 1 3]

[8] [2] and [8] [3 3 1 1] and [3 1 1 3] [2] and [

3] [3 3 1 1]

[9] [7] and [8] [3 1 1 1] and [3 1 1 3] [2] and [

3] [3 1 1 1]

[10] [2] and [9] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

Table 6 Population CP after crossover and mutation

Index Population after crossover Population aftermutation (CP)

[1] [2 1 1 1] [2 1 1 1]

[2] [3 3 1 1] [3 3 1 1]

[3] [3 3 5 1] [3 3 4 1][4] [3 3 1 1] [3 3 1 1]

[5] [3 3 3 1] [3 3 3 1]

[6] [3 3 3 3] [3 3 3 3]

[7] [3 1 1 3] [3 1 1 3]

[8] [3 3 1 3] [3 3 1 3]

[9] [3 1 3 3] [3 3 3 3]

[10] [3 3 1 1] [3 3 1 1]

CC51

times 11988751

= CC51

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

170 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 35700000

LPC1times 1198861

= LPC1times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times

Card (1198773)

Dist43

times Card (1198773)

times (

4

sum

119894=1

Size (119877119894)))

10 The Scientific World Journal

Table 7 Front assignment for query plans in IP

Index Intermediate population IP TCC TPC Front[1] [2 1 5 1] 18020000 3747000

2

[2] [3 3 1 1] 6720000 6006000 3

[3] [3 3 3 3] 0 4200000 1

[4] [2 4 1 5] 43440000 7079000 4

[5] [3 1 1 3] 14000000 2112000 1

[6] [3 5 4 5] 17020000 3847000 2

[7] [3 1 1 1] 960000 6500000 2

[8] [3 1 1 3] 1020000 9750000 3

[9] [2 3 3 3] 1110000 9750000 3

[10] [2 1 3 4] 45090000 10514000 5

[11] [2 1 1 1] 990000 6500000 2

[12] [3 3 1 1] 6720000 6006000 3

[13] [3 3 4 1] 90300000 8946000 5

[14] [3 3 1 1] 6720000 6006000 3

[15] [3 3 3 1] 2520000 4230000 2

[16] [3 3 3 3] 0 4200000 1

[17] [3 1 1 3] 14000000 2112000 1

[18] [3 3 1 3] 4500000 2640000 1

[19] [3 3 3 3] 0 4200000 1

[20] [3 3 1 1] 6720000 6006000 3

Table 8 Nondominated sorting for query plans in IP

Index Intermediatepopulation IP Front Crowding

distance[3] [3 3 3 3]

1

[5] [3 1 1 3] 1

[16] [3 3 3 3] 1

[17] [3 1 1 3] 1

[18] [3 3 1 3] 1

[19] [3 3 3 3] 1

[1] [2 1 5 1] 2 infin

[7] [3 1 1 1] 2 infin

[11] [2 1 1 1] 2 infin

[15] [3 3 3 1] 2 04933

[6] [3 5 4 5] 2 02292

[2] [3 3 1 1] 3

[9] [2 3 3 3] 3

[8] [3 1 1 3] 3

[12] [3 3 1 1] 3

[14] [3 3 1 1] 3

[20] [3 3 1 1] 3

[4] [2 4 1 5] 4

[10] [2 1 3 4] 5

[13] [3 3 4 1] 4

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

times

500

01 times 500

) times 140 = 65333333

(12)

Table 9 Population PP in the 2nd generation

Index Population PP[1] [3 3 3 3]

[2] [3 1 1 3]

[3] [3 3 3 3]

[4] [3 1 1 3]

[5] [3 3 1 3]

[6] [3 3 3 3]

[7] [2 1 5 1]

[8] [3 1 1 1]

[9] [2 1 1 1]

[10] [3 3 3 1]

So

TPC4= 126000 + 420000 + 65333333 = 70793333

TCC4= 1020000 + 6720000 + 35700000 = 43440000

(13)

Similarly TCC and TPC values of the other nine queryplans are computed Consider TCC and TPC of the 10 queryplans are given in Table 2 The population is then sorted intodifferent nondominated fronts as described in Step 3 of theproposed algorithm For example for query plan 1 that is[2 1 5 1] the set 119878

1= number of query plans that dominate

query plan 1 Since TCC [1] lt TCC [4] TPC [1] lt TPC [4]TCC [1] lt TCC [10] and TPC [1] lt TPC [10] the elementsof 1198781

= 4 10 Similarly the sets 1198782 119878

10are computed

and are given in Table 2 119899119894stores the count of query plans

that dominate 119894 So using the values in 119878119894 1198991

= 1 as onlyquery plan 5 is dominating 1 and 119899

2= 1 as only query plan 3

is dominating 2 Similarly 1198992 119899

10are computed and are

The Scientific World Journal 11

given in Table 2 From Table 2 it can be noted that queryplans 3 and 5 are not dominated as 119899

3= 0 119899

5= 0 So they

are assigned to the first nondominated front1The elementsin the next front are computed by reducing the count in 119899

119894

for each 119894 isin 1198783and 119894 isin 119878

5 So 119899

119894= 0 0 minus 4 minus 0 0 1 1 7

So the second front has query plans 1 2 6 and 7 Thisprocess continues till all the query plans in the population areassigned to their respective nondominated fronts The fronts1 2 3 and

4are formed and are given in Table 2

Finally the query plans are sorted separately on the valuesof TCC and TPC within each front as shown in Table 3

After the population is sorted into different fronts thecrowding distance (119868[119889119894119904119905119886119899119888119890]) computation is performedfor each query plan using the formula given in Step 4 ofthe proposed algorithm Query plans having the maximumand minimum values in each front are assigned infin distancevalues that is 119868[3] 119868[5] 119868[7] 119868[1] 119868[8] 119868[4] 119868[9] and119868[10] are assigned infin For the rest of the query plans theircrowding distance (CD) values are computed based on thesorted order of query plans in each front Initially they areassigned 119868[119889119894119904119905119886119899119888119890] = 0 For query plan 2 in front

2 the

CD computation is performed as follows

1198681198981[2] = 119868 [2] +

TCC [6] minus TCC [7]

max (TCC) minusmin (TCC)

= 0 +

17020000 minus 960000

45090000 minus 0

= 03562

119868 [2] = 1198681198981[2] +

TPC [7] minus TPC [6]

max (TPC) minusmin (TPC)

= 03562 +

6500000 minus 3847000

10514000 minus 2112000

= 06720

(14)

CD of the query plans in the given population is given inTable 4

Next binary tournament selection is performed on thepopulation on the basis of crowded comparison operator ≺

119899

This selection process is shown in Table 5The selected query plans undergo random single point

crossover with crossover probability 119875119888

= 05 Mutation isperformed on the selected population with mutation proba-bility 119875

119898= 002 The child population CP after crossover and

mutation is shown in Table 6Now in accordance with NSGA-II algorithm the popu-

lations from the second generation onward have to ensureelitism For this purpose the child population CP is com-bined with the parent population PP to generate intermediatepopulation IP This population is subjected to nondominatedsort and fronts are formed as given in Table 7

The population for the second generation is arrived atby selecting query plans based on front and within it basedon crowding distance-wise as described in Step 7 from theintermediate population IP till the actual population size 10is exceeded This selection is shown in Table 8

The population PP for the second generation is given inTable 9

The above process is repeated till a predefined number ofgenerations119866have elapsedThereafter Top-119870 query plans areproduced as output

0

01

02

03

04

05

06

07

08

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-5 query plans)

Figure 4

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

DQPGNSGA(top-10 query plans)

Figure 5

5 Experimental Results

The proposed NSGA-II based algorithm is implemented inMATLAB 77 in Windows 7 professional 64 bit OS with Intelcore i3CPUat 213GHzhaving 4GBRAMExperimentswerecarried out for a population of 100 query plans with eachquery plan involving 10 relations distributed over 50 sitesThese were performed on four datasets each comprising adifferent relation-site matrix Graphs were plotted to observechange in average TC (ATC) with respect to generations

12 The Scientific World Journal

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-15 query plans)

Figure 6

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-20 query plans)

Figure 7

and Top-119870 query plans for different pairs of crossover andmutation rates

First graphs showing the ATC values Top-5 Top-10Top-15 and Top-20 query plans averaged over four datasetsover 1000 generations for distinct pairs of crossover andmutation probability 119875

119888 119875119898 = 07 001 07 0005

075 001 075 0005 08 001 08 0005 085 001085 0005 using NSGA-II based DQPG algorithm(DQPGNSGA) were plotted These are shown in Figures 45 6 and 7 It can be observed from these graphs that the

0

005

01

015

02

025

03

035

04

045

05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(after 1000 generations)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGNSGA

Top-K query plans

Figure 8

0

01

02

03

04

05

06

07

08

09

1

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-5 query plans)

Figure 9

convergence to ATC is lowest in the case of 119875119888 119875119898 =

085 001 Furthermore the graph showing ATC valuesaveraged over four datasets versus Top-119870 query plans after1000 generations (Figure 8) also shows that the lowest ATCvalues achieved are for 119875

119888 119875119898 = 085 001 Thus it

can be said that DQPGNSGA performs reasonably well for119875119888 119875119898 = 085 001 In order to compare DQPGNSGA with

SGA based DQPG algorithm (DQPGSGA) similar graphswere plotted for DQPGSGA These are shown in Figures 9 1011 12 and 13 It is noted from these graphs that DQPGSGA

The Scientific World Journal 13

0

02

04

06

08

1

12

14

1618

21 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-10 query plans)

Figure 10

0

02

04

06

08

1

12

14

16

18

2

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-15 query plans)

Figure 11

also achieves convergence to the lowest ATC values for119875119888 119875119898 = 085 001

Since the two algorithms DQPGNSGA and DQPGSGAconverge to a lower ATC value for the same crossover andmutation probabilities that is 119875

119888 119875119898 = 085 001 the

comparisons of the two algorithms can be carried out forthese observed probabilities

First the two algorithmsDQPGNSGA andDQPGSGA werecompared for each of the four datasets (Dataset-1 Dataset-2 Dataset-3 and Dataset-4) on the ATC values of Top-5Top-10 Top-15 and Top-20 query plans generated over 1000

0

04

08

12

16

2

24

28

32

36

4

1 54 107

160

213

266

319

372

425

478

531

584

637

690

743

796

849

902

955

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-20 query plans)

Figure 12

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

(after 1000 generations)

Aver

age t

otal

cost

(ATC

)

DQPGSGA

Top-K query plans

Figure 13

generations for 119875119888 119875119898 = 085 001 The corresponding

graphs for each dataset are shown in Figures 14 15 16 and 17It can be observed from the graphs thatDQPGNSGA convergesto lower ATC values for Top-5 Top-10 Top-15 and Top-20query plans generated for all datasets

Furthermore graphs comparing the ATC values of Top-119870 query plans produced by DQPGNSGA and DQPGSGA after1000 generations for the four datasets were plotted andare shown in Figures 18 19 20 and 21 These graphs alsoshow average TCC (ATCC) and average TPC (ATPC) valuesof Top-119870 query plans generated by DQPGNSGA It can beobserved from these graphs thatDQPGNSGA is able to achieve

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

6 The Scientific World Journal

Input 119877119894 Relations participating in the query 119875

119888 Probability of crossover

119875119898 Probability of mutation 119866 Pre-defined number of generations 119875size Population Size

Output Top-n query plansMethodInitialize a random parent population of query plans PPwhere chromosome length ldquolenrdquo is the number of relations accessed in the query and the gene at the 119896th position in thechromosome represents the site of the 119896th relationWHILE generation le 119866 DOStep 1 Evaluate each query plan in PP on the following objective functions

1198981 Minimize TCC =

119894=119906minus1119895=119906

sum

119894=1119895=119894+1

CC119894119895times 119887119894

1198982 Minimize TPC =

119906

sum

119894=1

LPC119894times 119886119894

where 119906 is the number of sites accessed by the query plan in ascending order of cardinality per site CC119894119895is the communication

cost per byte between sites 119894 and 119895 LPC119894is the local processing cost per byte at site 119894 119887

119894is the bytes to be communicated from

site 119894 and 119886119894is the bytes to be processed at site 119894

Step 2 Perform Non-Dominated (ND) Sort on PP for ldquo1198981rdquo and ldquo119898

2rdquo separately and place each query plan (QP)

into corresponding ND fronts ldquo119894rdquo and sort the QPs within each ldquo

119894rdquo

Step 3 Evaluate Crowding Distance Function 119868(119889) for each objective functionAssign 119868(119889) = infin for smallest and highest values in each front ldquo

119894rdquo

For the remaining QPs 119868(119889) is calculated as

119868 (119889119896) = 119868 (119889

119896) +

119868(119896 + 1)119898minus 119868(119896 minus 1)

119898

119891

max119898

minus 119891

min119898

where 119868(119896)119898is the value of119898th objective function of 119896th query plan in Front

119894and 119891

max119898

and 119891

min119898

are themaximum and minimum values obtained for the objective function119898Step 4 Perform Selection from PP using binary tournament selection using crowded comparison operator (≺

119899)

Step 5 Perform random single point crossover on selected chromosomes with crossover probability 119875119888

Step 6 Apply mutation on resulting population with mutation probability 119875119898

Let the resulting child population be CPStep 7 Append CP into PP and let the resulting intermediate population be IPStep 8 Repeat Step 1 and Step 2 for population IPStep 9 Form the population PP for the next generation by picking query plans Front-wise from IP till thepopulation size = 119875sizeStep 10 Increment Generation by 1ENDDOReturn Top-119870 Query Plans from population PP

Algorithm 1 NSGA-II based DQPG algorithm

In order to performanondominated sort each query planis compared with every other query plan in the population tofind if it is dominated For each query plan ldquo119894rdquo the followingtwo entities are considered

(i) 119899119894 The number of query plans that dominate the

query plan 119894(ii) 119878119894 The set of query plans that query plan 119894 dominates

All query plans that have 119899119894= 0 are added to the set

1

Set = 1where is called the current front For each

element 119909 in the current front visit each member 119895 in the set119878119909and reduce the count of 119899

119895by 1 Now if 119899

119895gets reduced to

zero for some 119895 add it to the set 2 After evaluating all the

members of in a similar manner set = 2 This process

continues till all the query plans are assigned some frontThe fast nondominated sorting procedure takes the currentpopulation as input and produces a list of nondominatedfronts

119894as output

Step 4 (Density Estimation Using Crowding Distance [446]) After the nondominated sort the crowding distanceis computed for each query plan in

119894 Crowding distance

[4] is an estimate of the density of solutions surrounding aparticular solution point in the population It is defined asthe average distance of the two closest points on either sidesof the given point along each of the objectives The crowdingdistance 119868(119889119894119904119905119886119899119888119890) is computed in the following manner[4 46]

For each front 119894 let 119873 be the number of query plans

in front 119894 Initially the crowding distance for each query

plan in the front 119894is zero That is 119868(119889

119895) = 0 119895 =

1 119873 Next for each objective function the query plansin the front

119894are sorted based on their value of TPC (ie

the first objective function) and similarly also with respectto TCC (ie the second objective function) and placed in119868(119889119894119904119905119886119899119888119890) The query plans having the smallest and thehighest 119868(119889119894119904119905119886119899119888119890) values in both sets are assigned an infinite

The Scientific World Journal 7

Rejected

Crowding distance sort PP

CP

IP

PP

Nondominatedsort

Ŧ1

Ŧ2

Ŧ3

Figure 2 NSGA-II procedure preserving elitism [4]

value for 119868(119889119894119904119905119886119899119888119890) that is 119868(1198891) = infin and 119868(119889

119873) = infin For

remaining query plans that is 119896 = 2 119873 minus 1 119868(119889119894119904119905119886119899119888119890)is computed as follows [4 46]

119868 (119889119896) = 119868 (119889

119896) +

119868(119896 + 1)119898minus 119868(119896 minus 1)

119898

119891

max119898

minus 119891

min119898

(10)

where 119868(119896)119898is the value of 119898th objective function of 119896th

query plan in front 119894and 119891

max119898

and 119891

min119898

are the maximumand minimum values obtained for the objective function119898

Step 5 (Binary Tournament Selection) After assigning thecrowding distance to the query plans in each front a selectionprocess is carried outThe selection scheme used is the binarytournament selection and it is carried out using the crowdedcomparison operator (≺

119899) [4 46] It uses two parameters as

given below(i) 120588rank (nondomination rank)The query plans in front

119894will have 120588rank = 119894

(ii) 119868119894(119889119895) (crowding distance in front 119894) 119895 = 1 119873

The crowded comparison is performed as described below [446]

For any two query plans 1199021199011and 119902119901

2 1199021199011is selected if

(1199021199011≺1198991199021199012) 1199021199011≺1198991199021199012is true if either one of the following

holds(i) 120588rank(1199021199011) lt 120588rank(1199021199012)(ii) if 119902119901

1and 119902119901

2belong to the same front (ie if

120588rank(1199021199011) = 120588rank(1199021199012)) then 119868119894(1198891199021199011

) gt 119868119894(1198891199021199012

)

Step 6 (Crossover andMutation) Crossover is performed onthe selected query plans with a given crossover probability119875119888 It ensures proper exploration of the search space by

combining the best features of the parent query plans (chro-mosomes) Mutation is performed on the given populationwith a given probability 119875

119898 It randomly changes the site

(gene) in which the corresponding relation resides within aquery plan (chromosome) The mutated gene always takes arandom value from a set of valid sites for a particular relationAfter going through the above steps the first generationpopulation is formed NSGA-II follows a different methodto produce subsequent generations in order to incorporateelitism as described next

Table 1 Initial parent population PP

[2 1 5 1] [3 5 4 5]

[3 3 1 1] [3 1 1 1]

[3 3 3 3] [3 4 4 4]

[2 4 1 5] [2 3 3 3]

[3 1 1 3] [2 1 3 4]

Step 7 (Preserving Good Solutions (Elitism) [4]) In subse-quent generations the new population after each generationis combined with the parent population and a new interme-diate population IP is created of size PP + CP where PP is theparent population and CP is the child population as shown inFigure 2

The non-dominated sort is applied to this intermediatepopulation and fronts are formed as described in Step 3Finally the population for the next generation is formedby adding solutions from each front till the population sizeexceeds 119875 If the last front to be included was

119888 which led

to the population overflow then query plans in Front 119888are

selected based on their crowding distance measure (Step 4)in descending order until the population size exceeds 119875

The above steps are repeated for ldquo119866rdquo generations and theTop-119870 query plans are produced as output

An example illustrating the use of the above NSGA-IIbased DQPG algorithm to generate query plans for a givendistributed query is given next

4 An Example

Consider the site relation matrix the communication-cost matrix the local processing cost matrix the distinct-tuple matrix and the size matrix used to compute thefitness of query plans given in [3] and shown below inFigure 3 Suppose the initial parent population PP com-prises of 10 query plans given in Table 1 Consider aquery that accesses four relations (1198771 1198772 1198773 and 1198774)which are distributed among five sites (1198781 1198782 1198783 1198784 and1198785)

8 The Scientific World Journal

R1 R2 R3 R4300200 500 700

Cardinality

R1 R2 R3 R430 40 50 20

Size

S1 S2 S3 S4 S52 2 3 3 2

S1 S2 S3 S4 S5

S1 150 180 150 160

S2 165 185 170 170

S3 160 160 170 150

S4 150 150 180 160

S5 170 160 190 150

Communication costmatrix

R1 R2 R3 R4

R1 04 05 04 04

R2 05 02 02 03

R3 04 02 02 01

R4 04 03 01 03

matrix for join

R1 R2 R3 R4S1 0 1 1 1S2 1 0 0 0S3 1 1 1 1S4 0 1 1 1S5 0 0 1 1

matrixLPCi

Distinct-tuple- Relation-site

mdash

mdashmdash

mdash

mdash

Figure 3 Matrices used for computing fitness [3]

Table 2 Nondominated sort on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC 119878

119894119899119894

Front[1] [2 1 5 1] 18020000 3746700 [4 10] 1

2

[2] [3 3 1 1] 6720000 6006000 [4 10] 1 2

[3] [3 3 3 3] 0 4200000 [2 4 7 8 9 10] mdash 1

[4] [2 4 1 5] 43440000 70793333 [10] 6 3

[5] [3 1 1 3] 14000000 2112000 [1 4 6 10] mdash 1

[6] [3 5 4 5] 17020000 3847000 [4 10] 1 2

[7] [3 1 1 1] 960000 6500000 [4 8 9 10] 1 2

[8] [3 4 4 4] 1020000 9750000 [10] 2 3

[9] [2 3 3 3] 1110000 9750000 [10] 2 3

[10] [2 1 3 4] 45090000 10514000 mdash 9 4

Table 3 Sorted fronts based on TCC and TPC

Query plans sorted on1198981= TCC

Query plans sorted on1198982= TPC

1

3 5 1

5 32

7 2 6 1 2

1 6 2 73

8 9 4 3

4 8 94

10 4

10

The computations of TPC and TCC for the query plan[2 4 1 5] are given as follows

TPC4= (LPC

2times 1198862) + (LPC

4times 1198864)

+ (LPC5times 1198865) + (LPC

1times 1198861)

TCC4= (CC

24times 11988724

) + (CC45

times 11988745

) + (CC51

times 11988751

)

(11)

whereLPC2times 1198862= 0

CC24

times 11988724

= CC24

times (Card (1198771) times Size (1198771))

= 170 times (200 times 30) = 1020000

LPC4times 1198864= LPC

4times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)) )

= 2 times (

200 times 300

05 times 200

times (70)) = 126000

CC45

times 11988745

= CC45

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)))

= 170 times (

200 times 300

05 times 200

times (70)) = 6720000

LPC5times 1198865

= LPC5

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 420000

The Scientific World Journal 9

Table 4 Results for the objective functions on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC Front Crowding distance (CD) = 119868[119894]

[1] [2 1 5 1] 18020000 3746700 2

infin

[2] [3 3 1 1] 6720000 6006000 2

0672[3] [3 3 3 3] 0 4200000

1infin

[4] [2 4 1 5] 43440000 70793333 3

infin

[5] [3 1 1 3] 14000000 2112000 1

infin

[6] [3 5 4 5] 17020000 3847000 2

05195[7] [3 1 1 1] 960000 6500000

2infin

[8] [3 4 4 4] 1020000 9750000 3

infin

[9] [2 3 3 3] 1110000 9750000 3

infin

[10] [2 1 3 4] 45090000 10514000 4

infin

Table 5 Selection of query plans using binary tournament selection technique

IndexRandomly generated

indexes[119894] and [119895]

Tournament betweenquery plans

[119875(119894)] and [119875(119895)]

Front comparison Query plan selected (lowerfront or higher CD)

[1] [1] and [4] [2 1 5 1] and [2 4 1 5] [2] and [

3] [2 1 5 1]

[2] [2] and [3] [3 3 1 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[3] [3] and [4] [3 3 3 3] and [2 4 1 5] [1] and [

3] [3 3 3 3]

[4] [2] and [6] [3 3 1 1] and [3 5 4 5] [2] and [

2] [3 3 1 1]

[5] [2] and [4] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

[6] [1] and [3] [2 1 5 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[7] [2] and [5] [3 3 1 1] and [3 1 1 3] [2] and [

1] [3 1 1 3]

[8] [2] and [8] [3 3 1 1] and [3 1 1 3] [2] and [

3] [3 3 1 1]

[9] [7] and [8] [3 1 1 1] and [3 1 1 3] [2] and [

3] [3 1 1 1]

[10] [2] and [9] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

Table 6 Population CP after crossover and mutation

Index Population after crossover Population aftermutation (CP)

[1] [2 1 1 1] [2 1 1 1]

[2] [3 3 1 1] [3 3 1 1]

[3] [3 3 5 1] [3 3 4 1][4] [3 3 1 1] [3 3 1 1]

[5] [3 3 3 1] [3 3 3 1]

[6] [3 3 3 3] [3 3 3 3]

[7] [3 1 1 3] [3 1 1 3]

[8] [3 3 1 3] [3 3 1 3]

[9] [3 1 3 3] [3 3 3 3]

[10] [3 3 1 1] [3 3 1 1]

CC51

times 11988751

= CC51

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

170 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 35700000

LPC1times 1198861

= LPC1times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times

Card (1198773)

Dist43

times Card (1198773)

times (

4

sum

119894=1

Size (119877119894)))

10 The Scientific World Journal

Table 7 Front assignment for query plans in IP

Index Intermediate population IP TCC TPC Front[1] [2 1 5 1] 18020000 3747000

2

[2] [3 3 1 1] 6720000 6006000 3

[3] [3 3 3 3] 0 4200000 1

[4] [2 4 1 5] 43440000 7079000 4

[5] [3 1 1 3] 14000000 2112000 1

[6] [3 5 4 5] 17020000 3847000 2

[7] [3 1 1 1] 960000 6500000 2

[8] [3 1 1 3] 1020000 9750000 3

[9] [2 3 3 3] 1110000 9750000 3

[10] [2 1 3 4] 45090000 10514000 5

[11] [2 1 1 1] 990000 6500000 2

[12] [3 3 1 1] 6720000 6006000 3

[13] [3 3 4 1] 90300000 8946000 5

[14] [3 3 1 1] 6720000 6006000 3

[15] [3 3 3 1] 2520000 4230000 2

[16] [3 3 3 3] 0 4200000 1

[17] [3 1 1 3] 14000000 2112000 1

[18] [3 3 1 3] 4500000 2640000 1

[19] [3 3 3 3] 0 4200000 1

[20] [3 3 1 1] 6720000 6006000 3

Table 8 Nondominated sorting for query plans in IP

Index Intermediatepopulation IP Front Crowding

distance[3] [3 3 3 3]

1

[5] [3 1 1 3] 1

[16] [3 3 3 3] 1

[17] [3 1 1 3] 1

[18] [3 3 1 3] 1

[19] [3 3 3 3] 1

[1] [2 1 5 1] 2 infin

[7] [3 1 1 1] 2 infin

[11] [2 1 1 1] 2 infin

[15] [3 3 3 1] 2 04933

[6] [3 5 4 5] 2 02292

[2] [3 3 1 1] 3

[9] [2 3 3 3] 3

[8] [3 1 1 3] 3

[12] [3 3 1 1] 3

[14] [3 3 1 1] 3

[20] [3 3 1 1] 3

[4] [2 4 1 5] 4

[10] [2 1 3 4] 5

[13] [3 3 4 1] 4

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

times

500

01 times 500

) times 140 = 65333333

(12)

Table 9 Population PP in the 2nd generation

Index Population PP[1] [3 3 3 3]

[2] [3 1 1 3]

[3] [3 3 3 3]

[4] [3 1 1 3]

[5] [3 3 1 3]

[6] [3 3 3 3]

[7] [2 1 5 1]

[8] [3 1 1 1]

[9] [2 1 1 1]

[10] [3 3 3 1]

So

TPC4= 126000 + 420000 + 65333333 = 70793333

TCC4= 1020000 + 6720000 + 35700000 = 43440000

(13)

Similarly TCC and TPC values of the other nine queryplans are computed Consider TCC and TPC of the 10 queryplans are given in Table 2 The population is then sorted intodifferent nondominated fronts as described in Step 3 of theproposed algorithm For example for query plan 1 that is[2 1 5 1] the set 119878

1= number of query plans that dominate

query plan 1 Since TCC [1] lt TCC [4] TPC [1] lt TPC [4]TCC [1] lt TCC [10] and TPC [1] lt TPC [10] the elementsof 1198781

= 4 10 Similarly the sets 1198782 119878

10are computed

and are given in Table 2 119899119894stores the count of query plans

that dominate 119894 So using the values in 119878119894 1198991

= 1 as onlyquery plan 5 is dominating 1 and 119899

2= 1 as only query plan 3

is dominating 2 Similarly 1198992 119899

10are computed and are

The Scientific World Journal 11

given in Table 2 From Table 2 it can be noted that queryplans 3 and 5 are not dominated as 119899

3= 0 119899

5= 0 So they

are assigned to the first nondominated front1The elementsin the next front are computed by reducing the count in 119899

119894

for each 119894 isin 1198783and 119894 isin 119878

5 So 119899

119894= 0 0 minus 4 minus 0 0 1 1 7

So the second front has query plans 1 2 6 and 7 Thisprocess continues till all the query plans in the population areassigned to their respective nondominated fronts The fronts1 2 3 and

4are formed and are given in Table 2

Finally the query plans are sorted separately on the valuesof TCC and TPC within each front as shown in Table 3

After the population is sorted into different fronts thecrowding distance (119868[119889119894119904119905119886119899119888119890]) computation is performedfor each query plan using the formula given in Step 4 ofthe proposed algorithm Query plans having the maximumand minimum values in each front are assigned infin distancevalues that is 119868[3] 119868[5] 119868[7] 119868[1] 119868[8] 119868[4] 119868[9] and119868[10] are assigned infin For the rest of the query plans theircrowding distance (CD) values are computed based on thesorted order of query plans in each front Initially they areassigned 119868[119889119894119904119905119886119899119888119890] = 0 For query plan 2 in front

2 the

CD computation is performed as follows

1198681198981[2] = 119868 [2] +

TCC [6] minus TCC [7]

max (TCC) minusmin (TCC)

= 0 +

17020000 minus 960000

45090000 minus 0

= 03562

119868 [2] = 1198681198981[2] +

TPC [7] minus TPC [6]

max (TPC) minusmin (TPC)

= 03562 +

6500000 minus 3847000

10514000 minus 2112000

= 06720

(14)

CD of the query plans in the given population is given inTable 4

Next binary tournament selection is performed on thepopulation on the basis of crowded comparison operator ≺

119899

This selection process is shown in Table 5The selected query plans undergo random single point

crossover with crossover probability 119875119888

= 05 Mutation isperformed on the selected population with mutation proba-bility 119875

119898= 002 The child population CP after crossover and

mutation is shown in Table 6Now in accordance with NSGA-II algorithm the popu-

lations from the second generation onward have to ensureelitism For this purpose the child population CP is com-bined with the parent population PP to generate intermediatepopulation IP This population is subjected to nondominatedsort and fronts are formed as given in Table 7

The population for the second generation is arrived atby selecting query plans based on front and within it basedon crowding distance-wise as described in Step 7 from theintermediate population IP till the actual population size 10is exceeded This selection is shown in Table 8

The population PP for the second generation is given inTable 9

The above process is repeated till a predefined number ofgenerations119866have elapsedThereafter Top-119870 query plans areproduced as output

0

01

02

03

04

05

06

07

08

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-5 query plans)

Figure 4

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

DQPGNSGA(top-10 query plans)

Figure 5

5 Experimental Results

The proposed NSGA-II based algorithm is implemented inMATLAB 77 in Windows 7 professional 64 bit OS with Intelcore i3CPUat 213GHzhaving 4GBRAMExperimentswerecarried out for a population of 100 query plans with eachquery plan involving 10 relations distributed over 50 sitesThese were performed on four datasets each comprising adifferent relation-site matrix Graphs were plotted to observechange in average TC (ATC) with respect to generations

12 The Scientific World Journal

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-15 query plans)

Figure 6

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-20 query plans)

Figure 7

and Top-119870 query plans for different pairs of crossover andmutation rates

First graphs showing the ATC values Top-5 Top-10Top-15 and Top-20 query plans averaged over four datasetsover 1000 generations for distinct pairs of crossover andmutation probability 119875

119888 119875119898 = 07 001 07 0005

075 001 075 0005 08 001 08 0005 085 001085 0005 using NSGA-II based DQPG algorithm(DQPGNSGA) were plotted These are shown in Figures 45 6 and 7 It can be observed from these graphs that the

0

005

01

015

02

025

03

035

04

045

05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(after 1000 generations)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGNSGA

Top-K query plans

Figure 8

0

01

02

03

04

05

06

07

08

09

1

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-5 query plans)

Figure 9

convergence to ATC is lowest in the case of 119875119888 119875119898 =

085 001 Furthermore the graph showing ATC valuesaveraged over four datasets versus Top-119870 query plans after1000 generations (Figure 8) also shows that the lowest ATCvalues achieved are for 119875

119888 119875119898 = 085 001 Thus it

can be said that DQPGNSGA performs reasonably well for119875119888 119875119898 = 085 001 In order to compare DQPGNSGA with

SGA based DQPG algorithm (DQPGSGA) similar graphswere plotted for DQPGSGA These are shown in Figures 9 1011 12 and 13 It is noted from these graphs that DQPGSGA

The Scientific World Journal 13

0

02

04

06

08

1

12

14

1618

21 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-10 query plans)

Figure 10

0

02

04

06

08

1

12

14

16

18

2

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-15 query plans)

Figure 11

also achieves convergence to the lowest ATC values for119875119888 119875119898 = 085 001

Since the two algorithms DQPGNSGA and DQPGSGAconverge to a lower ATC value for the same crossover andmutation probabilities that is 119875

119888 119875119898 = 085 001 the

comparisons of the two algorithms can be carried out forthese observed probabilities

First the two algorithmsDQPGNSGA andDQPGSGA werecompared for each of the four datasets (Dataset-1 Dataset-2 Dataset-3 and Dataset-4) on the ATC values of Top-5Top-10 Top-15 and Top-20 query plans generated over 1000

0

04

08

12

16

2

24

28

32

36

4

1 54 107

160

213

266

319

372

425

478

531

584

637

690

743

796

849

902

955

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-20 query plans)

Figure 12

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

(after 1000 generations)

Aver

age t

otal

cost

(ATC

)

DQPGSGA

Top-K query plans

Figure 13

generations for 119875119888 119875119898 = 085 001 The corresponding

graphs for each dataset are shown in Figures 14 15 16 and 17It can be observed from the graphs thatDQPGNSGA convergesto lower ATC values for Top-5 Top-10 Top-15 and Top-20query plans generated for all datasets

Furthermore graphs comparing the ATC values of Top-119870 query plans produced by DQPGNSGA and DQPGSGA after1000 generations for the four datasets were plotted andare shown in Figures 18 19 20 and 21 These graphs alsoshow average TCC (ATCC) and average TPC (ATPC) valuesof Top-119870 query plans generated by DQPGNSGA It can beobserved from these graphs thatDQPGNSGA is able to achieve

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

The Scientific World Journal 7

Rejected

Crowding distance sort PP

CP

IP

PP

Nondominatedsort

Ŧ1

Ŧ2

Ŧ3

Figure 2 NSGA-II procedure preserving elitism [4]

value for 119868(119889119894119904119905119886119899119888119890) that is 119868(1198891) = infin and 119868(119889

119873) = infin For

remaining query plans that is 119896 = 2 119873 minus 1 119868(119889119894119904119905119886119899119888119890)is computed as follows [4 46]

119868 (119889119896) = 119868 (119889

119896) +

119868(119896 + 1)119898minus 119868(119896 minus 1)

119898

119891

max119898

minus 119891

min119898

(10)

where 119868(119896)119898is the value of 119898th objective function of 119896th

query plan in front 119894and 119891

max119898

and 119891

min119898

are the maximumand minimum values obtained for the objective function119898

Step 5 (Binary Tournament Selection) After assigning thecrowding distance to the query plans in each front a selectionprocess is carried outThe selection scheme used is the binarytournament selection and it is carried out using the crowdedcomparison operator (≺

119899) [4 46] It uses two parameters as

given below(i) 120588rank (nondomination rank)The query plans in front

119894will have 120588rank = 119894

(ii) 119868119894(119889119895) (crowding distance in front 119894) 119895 = 1 119873

The crowded comparison is performed as described below [446]

For any two query plans 1199021199011and 119902119901

2 1199021199011is selected if

(1199021199011≺1198991199021199012) 1199021199011≺1198991199021199012is true if either one of the following

holds(i) 120588rank(1199021199011) lt 120588rank(1199021199012)(ii) if 119902119901

1and 119902119901

2belong to the same front (ie if

120588rank(1199021199011) = 120588rank(1199021199012)) then 119868119894(1198891199021199011

) gt 119868119894(1198891199021199012

)

Step 6 (Crossover andMutation) Crossover is performed onthe selected query plans with a given crossover probability119875119888 It ensures proper exploration of the search space by

combining the best features of the parent query plans (chro-mosomes) Mutation is performed on the given populationwith a given probability 119875

119898 It randomly changes the site

(gene) in which the corresponding relation resides within aquery plan (chromosome) The mutated gene always takes arandom value from a set of valid sites for a particular relationAfter going through the above steps the first generationpopulation is formed NSGA-II follows a different methodto produce subsequent generations in order to incorporateelitism as described next

Table 1 Initial parent population PP

[2 1 5 1] [3 5 4 5]

[3 3 1 1] [3 1 1 1]

[3 3 3 3] [3 4 4 4]

[2 4 1 5] [2 3 3 3]

[3 1 1 3] [2 1 3 4]

Step 7 (Preserving Good Solutions (Elitism) [4]) In subse-quent generations the new population after each generationis combined with the parent population and a new interme-diate population IP is created of size PP + CP where PP is theparent population and CP is the child population as shown inFigure 2

The non-dominated sort is applied to this intermediatepopulation and fronts are formed as described in Step 3Finally the population for the next generation is formedby adding solutions from each front till the population sizeexceeds 119875 If the last front to be included was

119888 which led

to the population overflow then query plans in Front 119888are

selected based on their crowding distance measure (Step 4)in descending order until the population size exceeds 119875

The above steps are repeated for ldquo119866rdquo generations and theTop-119870 query plans are produced as output

An example illustrating the use of the above NSGA-IIbased DQPG algorithm to generate query plans for a givendistributed query is given next

4 An Example

Consider the site relation matrix the communication-cost matrix the local processing cost matrix the distinct-tuple matrix and the size matrix used to compute thefitness of query plans given in [3] and shown below inFigure 3 Suppose the initial parent population PP com-prises of 10 query plans given in Table 1 Consider aquery that accesses four relations (1198771 1198772 1198773 and 1198774)which are distributed among five sites (1198781 1198782 1198783 1198784 and1198785)

8 The Scientific World Journal

R1 R2 R3 R4300200 500 700

Cardinality

R1 R2 R3 R430 40 50 20

Size

S1 S2 S3 S4 S52 2 3 3 2

S1 S2 S3 S4 S5

S1 150 180 150 160

S2 165 185 170 170

S3 160 160 170 150

S4 150 150 180 160

S5 170 160 190 150

Communication costmatrix

R1 R2 R3 R4

R1 04 05 04 04

R2 05 02 02 03

R3 04 02 02 01

R4 04 03 01 03

matrix for join

R1 R2 R3 R4S1 0 1 1 1S2 1 0 0 0S3 1 1 1 1S4 0 1 1 1S5 0 0 1 1

matrixLPCi

Distinct-tuple- Relation-site

mdash

mdashmdash

mdash

mdash

Figure 3 Matrices used for computing fitness [3]

Table 2 Nondominated sort on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC 119878

119894119899119894

Front[1] [2 1 5 1] 18020000 3746700 [4 10] 1

2

[2] [3 3 1 1] 6720000 6006000 [4 10] 1 2

[3] [3 3 3 3] 0 4200000 [2 4 7 8 9 10] mdash 1

[4] [2 4 1 5] 43440000 70793333 [10] 6 3

[5] [3 1 1 3] 14000000 2112000 [1 4 6 10] mdash 1

[6] [3 5 4 5] 17020000 3847000 [4 10] 1 2

[7] [3 1 1 1] 960000 6500000 [4 8 9 10] 1 2

[8] [3 4 4 4] 1020000 9750000 [10] 2 3

[9] [2 3 3 3] 1110000 9750000 [10] 2 3

[10] [2 1 3 4] 45090000 10514000 mdash 9 4

Table 3 Sorted fronts based on TCC and TPC

Query plans sorted on1198981= TCC

Query plans sorted on1198982= TPC

1

3 5 1

5 32

7 2 6 1 2

1 6 2 73

8 9 4 3

4 8 94

10 4

10

The computations of TPC and TCC for the query plan[2 4 1 5] are given as follows

TPC4= (LPC

2times 1198862) + (LPC

4times 1198864)

+ (LPC5times 1198865) + (LPC

1times 1198861)

TCC4= (CC

24times 11988724

) + (CC45

times 11988745

) + (CC51

times 11988751

)

(11)

whereLPC2times 1198862= 0

CC24

times 11988724

= CC24

times (Card (1198771) times Size (1198771))

= 170 times (200 times 30) = 1020000

LPC4times 1198864= LPC

4times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)) )

= 2 times (

200 times 300

05 times 200

times (70)) = 126000

CC45

times 11988745

= CC45

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)))

= 170 times (

200 times 300

05 times 200

times (70)) = 6720000

LPC5times 1198865

= LPC5

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 420000

The Scientific World Journal 9

Table 4 Results for the objective functions on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC Front Crowding distance (CD) = 119868[119894]

[1] [2 1 5 1] 18020000 3746700 2

infin

[2] [3 3 1 1] 6720000 6006000 2

0672[3] [3 3 3 3] 0 4200000

1infin

[4] [2 4 1 5] 43440000 70793333 3

infin

[5] [3 1 1 3] 14000000 2112000 1

infin

[6] [3 5 4 5] 17020000 3847000 2

05195[7] [3 1 1 1] 960000 6500000

2infin

[8] [3 4 4 4] 1020000 9750000 3

infin

[9] [2 3 3 3] 1110000 9750000 3

infin

[10] [2 1 3 4] 45090000 10514000 4

infin

Table 5 Selection of query plans using binary tournament selection technique

IndexRandomly generated

indexes[119894] and [119895]

Tournament betweenquery plans

[119875(119894)] and [119875(119895)]

Front comparison Query plan selected (lowerfront or higher CD)

[1] [1] and [4] [2 1 5 1] and [2 4 1 5] [2] and [

3] [2 1 5 1]

[2] [2] and [3] [3 3 1 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[3] [3] and [4] [3 3 3 3] and [2 4 1 5] [1] and [

3] [3 3 3 3]

[4] [2] and [6] [3 3 1 1] and [3 5 4 5] [2] and [

2] [3 3 1 1]

[5] [2] and [4] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

[6] [1] and [3] [2 1 5 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[7] [2] and [5] [3 3 1 1] and [3 1 1 3] [2] and [

1] [3 1 1 3]

[8] [2] and [8] [3 3 1 1] and [3 1 1 3] [2] and [

3] [3 3 1 1]

[9] [7] and [8] [3 1 1 1] and [3 1 1 3] [2] and [

3] [3 1 1 1]

[10] [2] and [9] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

Table 6 Population CP after crossover and mutation

Index Population after crossover Population aftermutation (CP)

[1] [2 1 1 1] [2 1 1 1]

[2] [3 3 1 1] [3 3 1 1]

[3] [3 3 5 1] [3 3 4 1][4] [3 3 1 1] [3 3 1 1]

[5] [3 3 3 1] [3 3 3 1]

[6] [3 3 3 3] [3 3 3 3]

[7] [3 1 1 3] [3 1 1 3]

[8] [3 3 1 3] [3 3 1 3]

[9] [3 1 3 3] [3 3 3 3]

[10] [3 3 1 1] [3 3 1 1]

CC51

times 11988751

= CC51

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

170 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 35700000

LPC1times 1198861

= LPC1times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times

Card (1198773)

Dist43

times Card (1198773)

times (

4

sum

119894=1

Size (119877119894)))

10 The Scientific World Journal

Table 7 Front assignment for query plans in IP

Index Intermediate population IP TCC TPC Front[1] [2 1 5 1] 18020000 3747000

2

[2] [3 3 1 1] 6720000 6006000 3

[3] [3 3 3 3] 0 4200000 1

[4] [2 4 1 5] 43440000 7079000 4

[5] [3 1 1 3] 14000000 2112000 1

[6] [3 5 4 5] 17020000 3847000 2

[7] [3 1 1 1] 960000 6500000 2

[8] [3 1 1 3] 1020000 9750000 3

[9] [2 3 3 3] 1110000 9750000 3

[10] [2 1 3 4] 45090000 10514000 5

[11] [2 1 1 1] 990000 6500000 2

[12] [3 3 1 1] 6720000 6006000 3

[13] [3 3 4 1] 90300000 8946000 5

[14] [3 3 1 1] 6720000 6006000 3

[15] [3 3 3 1] 2520000 4230000 2

[16] [3 3 3 3] 0 4200000 1

[17] [3 1 1 3] 14000000 2112000 1

[18] [3 3 1 3] 4500000 2640000 1

[19] [3 3 3 3] 0 4200000 1

[20] [3 3 1 1] 6720000 6006000 3

Table 8 Nondominated sorting for query plans in IP

Index Intermediatepopulation IP Front Crowding

distance[3] [3 3 3 3]

1

[5] [3 1 1 3] 1

[16] [3 3 3 3] 1

[17] [3 1 1 3] 1

[18] [3 3 1 3] 1

[19] [3 3 3 3] 1

[1] [2 1 5 1] 2 infin

[7] [3 1 1 1] 2 infin

[11] [2 1 1 1] 2 infin

[15] [3 3 3 1] 2 04933

[6] [3 5 4 5] 2 02292

[2] [3 3 1 1] 3

[9] [2 3 3 3] 3

[8] [3 1 1 3] 3

[12] [3 3 1 1] 3

[14] [3 3 1 1] 3

[20] [3 3 1 1] 3

[4] [2 4 1 5] 4

[10] [2 1 3 4] 5

[13] [3 3 4 1] 4

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

times

500

01 times 500

) times 140 = 65333333

(12)

Table 9 Population PP in the 2nd generation

Index Population PP[1] [3 3 3 3]

[2] [3 1 1 3]

[3] [3 3 3 3]

[4] [3 1 1 3]

[5] [3 3 1 3]

[6] [3 3 3 3]

[7] [2 1 5 1]

[8] [3 1 1 1]

[9] [2 1 1 1]

[10] [3 3 3 1]

So

TPC4= 126000 + 420000 + 65333333 = 70793333

TCC4= 1020000 + 6720000 + 35700000 = 43440000

(13)

Similarly TCC and TPC values of the other nine queryplans are computed Consider TCC and TPC of the 10 queryplans are given in Table 2 The population is then sorted intodifferent nondominated fronts as described in Step 3 of theproposed algorithm For example for query plan 1 that is[2 1 5 1] the set 119878

1= number of query plans that dominate

query plan 1 Since TCC [1] lt TCC [4] TPC [1] lt TPC [4]TCC [1] lt TCC [10] and TPC [1] lt TPC [10] the elementsof 1198781

= 4 10 Similarly the sets 1198782 119878

10are computed

and are given in Table 2 119899119894stores the count of query plans

that dominate 119894 So using the values in 119878119894 1198991

= 1 as onlyquery plan 5 is dominating 1 and 119899

2= 1 as only query plan 3

is dominating 2 Similarly 1198992 119899

10are computed and are

The Scientific World Journal 11

given in Table 2 From Table 2 it can be noted that queryplans 3 and 5 are not dominated as 119899

3= 0 119899

5= 0 So they

are assigned to the first nondominated front1The elementsin the next front are computed by reducing the count in 119899

119894

for each 119894 isin 1198783and 119894 isin 119878

5 So 119899

119894= 0 0 minus 4 minus 0 0 1 1 7

So the second front has query plans 1 2 6 and 7 Thisprocess continues till all the query plans in the population areassigned to their respective nondominated fronts The fronts1 2 3 and

4are formed and are given in Table 2

Finally the query plans are sorted separately on the valuesof TCC and TPC within each front as shown in Table 3

After the population is sorted into different fronts thecrowding distance (119868[119889119894119904119905119886119899119888119890]) computation is performedfor each query plan using the formula given in Step 4 ofthe proposed algorithm Query plans having the maximumand minimum values in each front are assigned infin distancevalues that is 119868[3] 119868[5] 119868[7] 119868[1] 119868[8] 119868[4] 119868[9] and119868[10] are assigned infin For the rest of the query plans theircrowding distance (CD) values are computed based on thesorted order of query plans in each front Initially they areassigned 119868[119889119894119904119905119886119899119888119890] = 0 For query plan 2 in front

2 the

CD computation is performed as follows

1198681198981[2] = 119868 [2] +

TCC [6] minus TCC [7]

max (TCC) minusmin (TCC)

= 0 +

17020000 minus 960000

45090000 minus 0

= 03562

119868 [2] = 1198681198981[2] +

TPC [7] minus TPC [6]

max (TPC) minusmin (TPC)

= 03562 +

6500000 minus 3847000

10514000 minus 2112000

= 06720

(14)

CD of the query plans in the given population is given inTable 4

Next binary tournament selection is performed on thepopulation on the basis of crowded comparison operator ≺

119899

This selection process is shown in Table 5The selected query plans undergo random single point

crossover with crossover probability 119875119888

= 05 Mutation isperformed on the selected population with mutation proba-bility 119875

119898= 002 The child population CP after crossover and

mutation is shown in Table 6Now in accordance with NSGA-II algorithm the popu-

lations from the second generation onward have to ensureelitism For this purpose the child population CP is com-bined with the parent population PP to generate intermediatepopulation IP This population is subjected to nondominatedsort and fronts are formed as given in Table 7

The population for the second generation is arrived atby selecting query plans based on front and within it basedon crowding distance-wise as described in Step 7 from theintermediate population IP till the actual population size 10is exceeded This selection is shown in Table 8

The population PP for the second generation is given inTable 9

The above process is repeated till a predefined number ofgenerations119866have elapsedThereafter Top-119870 query plans areproduced as output

0

01

02

03

04

05

06

07

08

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-5 query plans)

Figure 4

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

DQPGNSGA(top-10 query plans)

Figure 5

5 Experimental Results

The proposed NSGA-II based algorithm is implemented inMATLAB 77 in Windows 7 professional 64 bit OS with Intelcore i3CPUat 213GHzhaving 4GBRAMExperimentswerecarried out for a population of 100 query plans with eachquery plan involving 10 relations distributed over 50 sitesThese were performed on four datasets each comprising adifferent relation-site matrix Graphs were plotted to observechange in average TC (ATC) with respect to generations

12 The Scientific World Journal

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-15 query plans)

Figure 6

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-20 query plans)

Figure 7

and Top-119870 query plans for different pairs of crossover andmutation rates

First graphs showing the ATC values Top-5 Top-10Top-15 and Top-20 query plans averaged over four datasetsover 1000 generations for distinct pairs of crossover andmutation probability 119875

119888 119875119898 = 07 001 07 0005

075 001 075 0005 08 001 08 0005 085 001085 0005 using NSGA-II based DQPG algorithm(DQPGNSGA) were plotted These are shown in Figures 45 6 and 7 It can be observed from these graphs that the

0

005

01

015

02

025

03

035

04

045

05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(after 1000 generations)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGNSGA

Top-K query plans

Figure 8

0

01

02

03

04

05

06

07

08

09

1

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-5 query plans)

Figure 9

convergence to ATC is lowest in the case of 119875119888 119875119898 =

085 001 Furthermore the graph showing ATC valuesaveraged over four datasets versus Top-119870 query plans after1000 generations (Figure 8) also shows that the lowest ATCvalues achieved are for 119875

119888 119875119898 = 085 001 Thus it

can be said that DQPGNSGA performs reasonably well for119875119888 119875119898 = 085 001 In order to compare DQPGNSGA with

SGA based DQPG algorithm (DQPGSGA) similar graphswere plotted for DQPGSGA These are shown in Figures 9 1011 12 and 13 It is noted from these graphs that DQPGSGA

The Scientific World Journal 13

0

02

04

06

08

1

12

14

1618

21 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-10 query plans)

Figure 10

0

02

04

06

08

1

12

14

16

18

2

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-15 query plans)

Figure 11

also achieves convergence to the lowest ATC values for119875119888 119875119898 = 085 001

Since the two algorithms DQPGNSGA and DQPGSGAconverge to a lower ATC value for the same crossover andmutation probabilities that is 119875

119888 119875119898 = 085 001 the

comparisons of the two algorithms can be carried out forthese observed probabilities

First the two algorithmsDQPGNSGA andDQPGSGA werecompared for each of the four datasets (Dataset-1 Dataset-2 Dataset-3 and Dataset-4) on the ATC values of Top-5Top-10 Top-15 and Top-20 query plans generated over 1000

0

04

08

12

16

2

24

28

32

36

4

1 54 107

160

213

266

319

372

425

478

531

584

637

690

743

796

849

902

955

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-20 query plans)

Figure 12

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

(after 1000 generations)

Aver

age t

otal

cost

(ATC

)

DQPGSGA

Top-K query plans

Figure 13

generations for 119875119888 119875119898 = 085 001 The corresponding

graphs for each dataset are shown in Figures 14 15 16 and 17It can be observed from the graphs thatDQPGNSGA convergesto lower ATC values for Top-5 Top-10 Top-15 and Top-20query plans generated for all datasets

Furthermore graphs comparing the ATC values of Top-119870 query plans produced by DQPGNSGA and DQPGSGA after1000 generations for the four datasets were plotted andare shown in Figures 18 19 20 and 21 These graphs alsoshow average TCC (ATCC) and average TPC (ATPC) valuesof Top-119870 query plans generated by DQPGNSGA It can beobserved from these graphs thatDQPGNSGA is able to achieve

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

8 The Scientific World Journal

R1 R2 R3 R4300200 500 700

Cardinality

R1 R2 R3 R430 40 50 20

Size

S1 S2 S3 S4 S52 2 3 3 2

S1 S2 S3 S4 S5

S1 150 180 150 160

S2 165 185 170 170

S3 160 160 170 150

S4 150 150 180 160

S5 170 160 190 150

Communication costmatrix

R1 R2 R3 R4

R1 04 05 04 04

R2 05 02 02 03

R3 04 02 02 01

R4 04 03 01 03

matrix for join

R1 R2 R3 R4S1 0 1 1 1S2 1 0 0 0S3 1 1 1 1S4 0 1 1 1S5 0 0 1 1

matrixLPCi

Distinct-tuple- Relation-site

mdash

mdashmdash

mdash

mdash

Figure 3 Matrices used for computing fitness [3]

Table 2 Nondominated sort on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC 119878

119894119899119894

Front[1] [2 1 5 1] 18020000 3746700 [4 10] 1

2

[2] [3 3 1 1] 6720000 6006000 [4 10] 1 2

[3] [3 3 3 3] 0 4200000 [2 4 7 8 9 10] mdash 1

[4] [2 4 1 5] 43440000 70793333 [10] 6 3

[5] [3 1 1 3] 14000000 2112000 [1 4 6 10] mdash 1

[6] [3 5 4 5] 17020000 3847000 [4 10] 1 2

[7] [3 1 1 1] 960000 6500000 [4 8 9 10] 1 2

[8] [3 4 4 4] 1020000 9750000 [10] 2 3

[9] [2 3 3 3] 1110000 9750000 [10] 2 3

[10] [2 1 3 4] 45090000 10514000 mdash 9 4

Table 3 Sorted fronts based on TCC and TPC

Query plans sorted on1198981= TCC

Query plans sorted on1198982= TPC

1

3 5 1

5 32

7 2 6 1 2

1 6 2 73

8 9 4 3

4 8 94

10 4

10

The computations of TPC and TCC for the query plan[2 4 1 5] are given as follows

TPC4= (LPC

2times 1198862) + (LPC

4times 1198864)

+ (LPC5times 1198865) + (LPC

1times 1198861)

TCC4= (CC

24times 11988724

) + (CC45

times 11988745

) + (CC51

times 11988751

)

(11)

whereLPC2times 1198862= 0

CC24

times 11988724

= CC24

times (Card (1198771) times Size (1198771))

= 170 times (200 times 30) = 1020000

LPC4times 1198864= LPC

4times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)) )

= 2 times (

200 times 300

05 times 200

times (70)) = 126000

CC45

times 11988745

= CC45

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times (Size (1198771) + Size (1198772)))

= 170 times (

200 times 300

05 times 200

times (70)) = 6720000

LPC5times 1198865

= LPC5

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 420000

The Scientific World Journal 9

Table 4 Results for the objective functions on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC Front Crowding distance (CD) = 119868[119894]

[1] [2 1 5 1] 18020000 3746700 2

infin

[2] [3 3 1 1] 6720000 6006000 2

0672[3] [3 3 3 3] 0 4200000

1infin

[4] [2 4 1 5] 43440000 70793333 3

infin

[5] [3 1 1 3] 14000000 2112000 1

infin

[6] [3 5 4 5] 17020000 3847000 2

05195[7] [3 1 1 1] 960000 6500000

2infin

[8] [3 4 4 4] 1020000 9750000 3

infin

[9] [2 3 3 3] 1110000 9750000 3

infin

[10] [2 1 3 4] 45090000 10514000 4

infin

Table 5 Selection of query plans using binary tournament selection technique

IndexRandomly generated

indexes[119894] and [119895]

Tournament betweenquery plans

[119875(119894)] and [119875(119895)]

Front comparison Query plan selected (lowerfront or higher CD)

[1] [1] and [4] [2 1 5 1] and [2 4 1 5] [2] and [

3] [2 1 5 1]

[2] [2] and [3] [3 3 1 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[3] [3] and [4] [3 3 3 3] and [2 4 1 5] [1] and [

3] [3 3 3 3]

[4] [2] and [6] [3 3 1 1] and [3 5 4 5] [2] and [

2] [3 3 1 1]

[5] [2] and [4] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

[6] [1] and [3] [2 1 5 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[7] [2] and [5] [3 3 1 1] and [3 1 1 3] [2] and [

1] [3 1 1 3]

[8] [2] and [8] [3 3 1 1] and [3 1 1 3] [2] and [

3] [3 3 1 1]

[9] [7] and [8] [3 1 1 1] and [3 1 1 3] [2] and [

3] [3 1 1 1]

[10] [2] and [9] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

Table 6 Population CP after crossover and mutation

Index Population after crossover Population aftermutation (CP)

[1] [2 1 1 1] [2 1 1 1]

[2] [3 3 1 1] [3 3 1 1]

[3] [3 3 5 1] [3 3 4 1][4] [3 3 1 1] [3 3 1 1]

[5] [3 3 3 1] [3 3 3 1]

[6] [3 3 3 3] [3 3 3 3]

[7] [3 1 1 3] [3 1 1 3]

[8] [3 3 1 3] [3 3 1 3]

[9] [3 1 3 3] [3 3 3 3]

[10] [3 3 1 1] [3 3 1 1]

CC51

times 11988751

= CC51

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

170 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 35700000

LPC1times 1198861

= LPC1times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times

Card (1198773)

Dist43

times Card (1198773)

times (

4

sum

119894=1

Size (119877119894)))

10 The Scientific World Journal

Table 7 Front assignment for query plans in IP

Index Intermediate population IP TCC TPC Front[1] [2 1 5 1] 18020000 3747000

2

[2] [3 3 1 1] 6720000 6006000 3

[3] [3 3 3 3] 0 4200000 1

[4] [2 4 1 5] 43440000 7079000 4

[5] [3 1 1 3] 14000000 2112000 1

[6] [3 5 4 5] 17020000 3847000 2

[7] [3 1 1 1] 960000 6500000 2

[8] [3 1 1 3] 1020000 9750000 3

[9] [2 3 3 3] 1110000 9750000 3

[10] [2 1 3 4] 45090000 10514000 5

[11] [2 1 1 1] 990000 6500000 2

[12] [3 3 1 1] 6720000 6006000 3

[13] [3 3 4 1] 90300000 8946000 5

[14] [3 3 1 1] 6720000 6006000 3

[15] [3 3 3 1] 2520000 4230000 2

[16] [3 3 3 3] 0 4200000 1

[17] [3 1 1 3] 14000000 2112000 1

[18] [3 3 1 3] 4500000 2640000 1

[19] [3 3 3 3] 0 4200000 1

[20] [3 3 1 1] 6720000 6006000 3

Table 8 Nondominated sorting for query plans in IP

Index Intermediatepopulation IP Front Crowding

distance[3] [3 3 3 3]

1

[5] [3 1 1 3] 1

[16] [3 3 3 3] 1

[17] [3 1 1 3] 1

[18] [3 3 1 3] 1

[19] [3 3 3 3] 1

[1] [2 1 5 1] 2 infin

[7] [3 1 1 1] 2 infin

[11] [2 1 1 1] 2 infin

[15] [3 3 3 1] 2 04933

[6] [3 5 4 5] 2 02292

[2] [3 3 1 1] 3

[9] [2 3 3 3] 3

[8] [3 1 1 3] 3

[12] [3 3 1 1] 3

[14] [3 3 1 1] 3

[20] [3 3 1 1] 3

[4] [2 4 1 5] 4

[10] [2 1 3 4] 5

[13] [3 3 4 1] 4

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

times

500

01 times 500

) times 140 = 65333333

(12)

Table 9 Population PP in the 2nd generation

Index Population PP[1] [3 3 3 3]

[2] [3 1 1 3]

[3] [3 3 3 3]

[4] [3 1 1 3]

[5] [3 3 1 3]

[6] [3 3 3 3]

[7] [2 1 5 1]

[8] [3 1 1 1]

[9] [2 1 1 1]

[10] [3 3 3 1]

So

TPC4= 126000 + 420000 + 65333333 = 70793333

TCC4= 1020000 + 6720000 + 35700000 = 43440000

(13)

Similarly TCC and TPC values of the other nine queryplans are computed Consider TCC and TPC of the 10 queryplans are given in Table 2 The population is then sorted intodifferent nondominated fronts as described in Step 3 of theproposed algorithm For example for query plan 1 that is[2 1 5 1] the set 119878

1= number of query plans that dominate

query plan 1 Since TCC [1] lt TCC [4] TPC [1] lt TPC [4]TCC [1] lt TCC [10] and TPC [1] lt TPC [10] the elementsof 1198781

= 4 10 Similarly the sets 1198782 119878

10are computed

and are given in Table 2 119899119894stores the count of query plans

that dominate 119894 So using the values in 119878119894 1198991

= 1 as onlyquery plan 5 is dominating 1 and 119899

2= 1 as only query plan 3

is dominating 2 Similarly 1198992 119899

10are computed and are

The Scientific World Journal 11

given in Table 2 From Table 2 it can be noted that queryplans 3 and 5 are not dominated as 119899

3= 0 119899

5= 0 So they

are assigned to the first nondominated front1The elementsin the next front are computed by reducing the count in 119899

119894

for each 119894 isin 1198783and 119894 isin 119878

5 So 119899

119894= 0 0 minus 4 minus 0 0 1 1 7

So the second front has query plans 1 2 6 and 7 Thisprocess continues till all the query plans in the population areassigned to their respective nondominated fronts The fronts1 2 3 and

4are formed and are given in Table 2

Finally the query plans are sorted separately on the valuesof TCC and TPC within each front as shown in Table 3

After the population is sorted into different fronts thecrowding distance (119868[119889119894119904119905119886119899119888119890]) computation is performedfor each query plan using the formula given in Step 4 ofthe proposed algorithm Query plans having the maximumand minimum values in each front are assigned infin distancevalues that is 119868[3] 119868[5] 119868[7] 119868[1] 119868[8] 119868[4] 119868[9] and119868[10] are assigned infin For the rest of the query plans theircrowding distance (CD) values are computed based on thesorted order of query plans in each front Initially they areassigned 119868[119889119894119904119905119886119899119888119890] = 0 For query plan 2 in front

2 the

CD computation is performed as follows

1198681198981[2] = 119868 [2] +

TCC [6] minus TCC [7]

max (TCC) minusmin (TCC)

= 0 +

17020000 minus 960000

45090000 minus 0

= 03562

119868 [2] = 1198681198981[2] +

TPC [7] minus TPC [6]

max (TPC) minusmin (TPC)

= 03562 +

6500000 minus 3847000

10514000 minus 2112000

= 06720

(14)

CD of the query plans in the given population is given inTable 4

Next binary tournament selection is performed on thepopulation on the basis of crowded comparison operator ≺

119899

This selection process is shown in Table 5The selected query plans undergo random single point

crossover with crossover probability 119875119888

= 05 Mutation isperformed on the selected population with mutation proba-bility 119875

119898= 002 The child population CP after crossover and

mutation is shown in Table 6Now in accordance with NSGA-II algorithm the popu-

lations from the second generation onward have to ensureelitism For this purpose the child population CP is com-bined with the parent population PP to generate intermediatepopulation IP This population is subjected to nondominatedsort and fronts are formed as given in Table 7

The population for the second generation is arrived atby selecting query plans based on front and within it basedon crowding distance-wise as described in Step 7 from theintermediate population IP till the actual population size 10is exceeded This selection is shown in Table 8

The population PP for the second generation is given inTable 9

The above process is repeated till a predefined number ofgenerations119866have elapsedThereafter Top-119870 query plans areproduced as output

0

01

02

03

04

05

06

07

08

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-5 query plans)

Figure 4

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

DQPGNSGA(top-10 query plans)

Figure 5

5 Experimental Results

The proposed NSGA-II based algorithm is implemented inMATLAB 77 in Windows 7 professional 64 bit OS with Intelcore i3CPUat 213GHzhaving 4GBRAMExperimentswerecarried out for a population of 100 query plans with eachquery plan involving 10 relations distributed over 50 sitesThese were performed on four datasets each comprising adifferent relation-site matrix Graphs were plotted to observechange in average TC (ATC) with respect to generations

12 The Scientific World Journal

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-15 query plans)

Figure 6

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-20 query plans)

Figure 7

and Top-119870 query plans for different pairs of crossover andmutation rates

First graphs showing the ATC values Top-5 Top-10Top-15 and Top-20 query plans averaged over four datasetsover 1000 generations for distinct pairs of crossover andmutation probability 119875

119888 119875119898 = 07 001 07 0005

075 001 075 0005 08 001 08 0005 085 001085 0005 using NSGA-II based DQPG algorithm(DQPGNSGA) were plotted These are shown in Figures 45 6 and 7 It can be observed from these graphs that the

0

005

01

015

02

025

03

035

04

045

05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(after 1000 generations)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGNSGA

Top-K query plans

Figure 8

0

01

02

03

04

05

06

07

08

09

1

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-5 query plans)

Figure 9

convergence to ATC is lowest in the case of 119875119888 119875119898 =

085 001 Furthermore the graph showing ATC valuesaveraged over four datasets versus Top-119870 query plans after1000 generations (Figure 8) also shows that the lowest ATCvalues achieved are for 119875

119888 119875119898 = 085 001 Thus it

can be said that DQPGNSGA performs reasonably well for119875119888 119875119898 = 085 001 In order to compare DQPGNSGA with

SGA based DQPG algorithm (DQPGSGA) similar graphswere plotted for DQPGSGA These are shown in Figures 9 1011 12 and 13 It is noted from these graphs that DQPGSGA

The Scientific World Journal 13

0

02

04

06

08

1

12

14

1618

21 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-10 query plans)

Figure 10

0

02

04

06

08

1

12

14

16

18

2

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-15 query plans)

Figure 11

also achieves convergence to the lowest ATC values for119875119888 119875119898 = 085 001

Since the two algorithms DQPGNSGA and DQPGSGAconverge to a lower ATC value for the same crossover andmutation probabilities that is 119875

119888 119875119898 = 085 001 the

comparisons of the two algorithms can be carried out forthese observed probabilities

First the two algorithmsDQPGNSGA andDQPGSGA werecompared for each of the four datasets (Dataset-1 Dataset-2 Dataset-3 and Dataset-4) on the ATC values of Top-5Top-10 Top-15 and Top-20 query plans generated over 1000

0

04

08

12

16

2

24

28

32

36

4

1 54 107

160

213

266

319

372

425

478

531

584

637

690

743

796

849

902

955

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-20 query plans)

Figure 12

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

(after 1000 generations)

Aver

age t

otal

cost

(ATC

)

DQPGSGA

Top-K query plans

Figure 13

generations for 119875119888 119875119898 = 085 001 The corresponding

graphs for each dataset are shown in Figures 14 15 16 and 17It can be observed from the graphs thatDQPGNSGA convergesto lower ATC values for Top-5 Top-10 Top-15 and Top-20query plans generated for all datasets

Furthermore graphs comparing the ATC values of Top-119870 query plans produced by DQPGNSGA and DQPGSGA after1000 generations for the four datasets were plotted andare shown in Figures 18 19 20 and 21 These graphs alsoshow average TCC (ATCC) and average TPC (ATPC) valuesof Top-119870 query plans generated by DQPGNSGA It can beobserved from these graphs thatDQPGNSGA is able to achieve

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

The Scientific World Journal 9

Table 4 Results for the objective functions on PP

Index [119894] Population PP 1198981= TCC 119898

2= TPC Front Crowding distance (CD) = 119868[119894]

[1] [2 1 5 1] 18020000 3746700 2

infin

[2] [3 3 1 1] 6720000 6006000 2

0672[3] [3 3 3 3] 0 4200000

1infin

[4] [2 4 1 5] 43440000 70793333 3

infin

[5] [3 1 1 3] 14000000 2112000 1

infin

[6] [3 5 4 5] 17020000 3847000 2

05195[7] [3 1 1 1] 960000 6500000

2infin

[8] [3 4 4 4] 1020000 9750000 3

infin

[9] [2 3 3 3] 1110000 9750000 3

infin

[10] [2 1 3 4] 45090000 10514000 4

infin

Table 5 Selection of query plans using binary tournament selection technique

IndexRandomly generated

indexes[119894] and [119895]

Tournament betweenquery plans

[119875(119894)] and [119875(119895)]

Front comparison Query plan selected (lowerfront or higher CD)

[1] [1] and [4] [2 1 5 1] and [2 4 1 5] [2] and [

3] [2 1 5 1]

[2] [2] and [3] [3 3 1 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[3] [3] and [4] [3 3 3 3] and [2 4 1 5] [1] and [

3] [3 3 3 3]

[4] [2] and [6] [3 3 1 1] and [3 5 4 5] [2] and [

2] [3 3 1 1]

[5] [2] and [4] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

[6] [1] and [3] [2 1 5 1] and [3 3 3 3] [2] and [

1] [3 3 3 3]

[7] [2] and [5] [3 3 1 1] and [3 1 1 3] [2] and [

1] [3 1 1 3]

[8] [2] and [8] [3 3 1 1] and [3 1 1 3] [2] and [

3] [3 3 1 1]

[9] [7] and [8] [3 1 1 1] and [3 1 1 3] [2] and [

3] [3 1 1 1]

[10] [2] and [9] [3 3 1 1] and [2 4 1 5] [2] and [

3] [3 3 1 1]

Table 6 Population CP after crossover and mutation

Index Population after crossover Population aftermutation (CP)

[1] [2 1 1 1] [2 1 1 1]

[2] [3 3 1 1] [3 3 1 1]

[3] [3 3 5 1] [3 3 4 1][4] [3 3 1 1] [3 3 1 1]

[5] [3 3 3 1] [3 3 3 1]

[6] [3 3 3 3] [3 3 3 3]

[7] [3 1 1 3] [3 1 1 3]

[8] [3 3 1 3] [3 3 1 3]

[9] [3 1 3 3] [3 3 3 3]

[10] [3 3 1 1] [3 3 1 1]

CC51

times 11988751

= CC51

times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times (Size (1198771) + Size (1198772) + Size (1198774)) )

170 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

)

times 90 = 35700000

LPC1times 1198861

= LPC1times (

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

times Card (1198774)

times (Dist24

times

Card (1198771) times Card (1198772)

Dist12

times Card (1198771)

)

minus1

times

Card (1198773)

Dist43

times Card (1198773)

times (

4

sum

119894=1

Size (119877119894)))

10 The Scientific World Journal

Table 7 Front assignment for query plans in IP

Index Intermediate population IP TCC TPC Front[1] [2 1 5 1] 18020000 3747000

2

[2] [3 3 1 1] 6720000 6006000 3

[3] [3 3 3 3] 0 4200000 1

[4] [2 4 1 5] 43440000 7079000 4

[5] [3 1 1 3] 14000000 2112000 1

[6] [3 5 4 5] 17020000 3847000 2

[7] [3 1 1 1] 960000 6500000 2

[8] [3 1 1 3] 1020000 9750000 3

[9] [2 3 3 3] 1110000 9750000 3

[10] [2 1 3 4] 45090000 10514000 5

[11] [2 1 1 1] 990000 6500000 2

[12] [3 3 1 1] 6720000 6006000 3

[13] [3 3 4 1] 90300000 8946000 5

[14] [3 3 1 1] 6720000 6006000 3

[15] [3 3 3 1] 2520000 4230000 2

[16] [3 3 3 3] 0 4200000 1

[17] [3 1 1 3] 14000000 2112000 1

[18] [3 3 1 3] 4500000 2640000 1

[19] [3 3 3 3] 0 4200000 1

[20] [3 3 1 1] 6720000 6006000 3

Table 8 Nondominated sorting for query plans in IP

Index Intermediatepopulation IP Front Crowding

distance[3] [3 3 3 3]

1

[5] [3 1 1 3] 1

[16] [3 3 3 3] 1

[17] [3 1 1 3] 1

[18] [3 3 1 3] 1

[19] [3 3 3 3] 1

[1] [2 1 5 1] 2 infin

[7] [3 1 1 1] 2 infin

[11] [2 1 1 1] 2 infin

[15] [3 3 3 1] 2 04933

[6] [3 5 4 5] 2 02292

[2] [3 3 1 1] 3

[9] [2 3 3 3] 3

[8] [3 1 1 3] 3

[12] [3 3 1 1] 3

[14] [3 3 1 1] 3

[20] [3 3 1 1] 3

[4] [2 4 1 5] 4

[10] [2 1 3 4] 5

[13] [3 3 4 1] 4

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

times

500

01 times 500

) times 140 = 65333333

(12)

Table 9 Population PP in the 2nd generation

Index Population PP[1] [3 3 3 3]

[2] [3 1 1 3]

[3] [3 3 3 3]

[4] [3 1 1 3]

[5] [3 3 1 3]

[6] [3 3 3 3]

[7] [2 1 5 1]

[8] [3 1 1 1]

[9] [2 1 1 1]

[10] [3 3 3 1]

So

TPC4= 126000 + 420000 + 65333333 = 70793333

TCC4= 1020000 + 6720000 + 35700000 = 43440000

(13)

Similarly TCC and TPC values of the other nine queryplans are computed Consider TCC and TPC of the 10 queryplans are given in Table 2 The population is then sorted intodifferent nondominated fronts as described in Step 3 of theproposed algorithm For example for query plan 1 that is[2 1 5 1] the set 119878

1= number of query plans that dominate

query plan 1 Since TCC [1] lt TCC [4] TPC [1] lt TPC [4]TCC [1] lt TCC [10] and TPC [1] lt TPC [10] the elementsof 1198781

= 4 10 Similarly the sets 1198782 119878

10are computed

and are given in Table 2 119899119894stores the count of query plans

that dominate 119894 So using the values in 119878119894 1198991

= 1 as onlyquery plan 5 is dominating 1 and 119899

2= 1 as only query plan 3

is dominating 2 Similarly 1198992 119899

10are computed and are

The Scientific World Journal 11

given in Table 2 From Table 2 it can be noted that queryplans 3 and 5 are not dominated as 119899

3= 0 119899

5= 0 So they

are assigned to the first nondominated front1The elementsin the next front are computed by reducing the count in 119899

119894

for each 119894 isin 1198783and 119894 isin 119878

5 So 119899

119894= 0 0 minus 4 minus 0 0 1 1 7

So the second front has query plans 1 2 6 and 7 Thisprocess continues till all the query plans in the population areassigned to their respective nondominated fronts The fronts1 2 3 and

4are formed and are given in Table 2

Finally the query plans are sorted separately on the valuesof TCC and TPC within each front as shown in Table 3

After the population is sorted into different fronts thecrowding distance (119868[119889119894119904119905119886119899119888119890]) computation is performedfor each query plan using the formula given in Step 4 ofthe proposed algorithm Query plans having the maximumand minimum values in each front are assigned infin distancevalues that is 119868[3] 119868[5] 119868[7] 119868[1] 119868[8] 119868[4] 119868[9] and119868[10] are assigned infin For the rest of the query plans theircrowding distance (CD) values are computed based on thesorted order of query plans in each front Initially they areassigned 119868[119889119894119904119905119886119899119888119890] = 0 For query plan 2 in front

2 the

CD computation is performed as follows

1198681198981[2] = 119868 [2] +

TCC [6] minus TCC [7]

max (TCC) minusmin (TCC)

= 0 +

17020000 minus 960000

45090000 minus 0

= 03562

119868 [2] = 1198681198981[2] +

TPC [7] minus TPC [6]

max (TPC) minusmin (TPC)

= 03562 +

6500000 minus 3847000

10514000 minus 2112000

= 06720

(14)

CD of the query plans in the given population is given inTable 4

Next binary tournament selection is performed on thepopulation on the basis of crowded comparison operator ≺

119899

This selection process is shown in Table 5The selected query plans undergo random single point

crossover with crossover probability 119875119888

= 05 Mutation isperformed on the selected population with mutation proba-bility 119875

119898= 002 The child population CP after crossover and

mutation is shown in Table 6Now in accordance with NSGA-II algorithm the popu-

lations from the second generation onward have to ensureelitism For this purpose the child population CP is com-bined with the parent population PP to generate intermediatepopulation IP This population is subjected to nondominatedsort and fronts are formed as given in Table 7

The population for the second generation is arrived atby selecting query plans based on front and within it basedon crowding distance-wise as described in Step 7 from theintermediate population IP till the actual population size 10is exceeded This selection is shown in Table 8

The population PP for the second generation is given inTable 9

The above process is repeated till a predefined number ofgenerations119866have elapsedThereafter Top-119870 query plans areproduced as output

0

01

02

03

04

05

06

07

08

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-5 query plans)

Figure 4

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

DQPGNSGA(top-10 query plans)

Figure 5

5 Experimental Results

The proposed NSGA-II based algorithm is implemented inMATLAB 77 in Windows 7 professional 64 bit OS with Intelcore i3CPUat 213GHzhaving 4GBRAMExperimentswerecarried out for a population of 100 query plans with eachquery plan involving 10 relations distributed over 50 sitesThese were performed on four datasets each comprising adifferent relation-site matrix Graphs were plotted to observechange in average TC (ATC) with respect to generations

12 The Scientific World Journal

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-15 query plans)

Figure 6

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-20 query plans)

Figure 7

and Top-119870 query plans for different pairs of crossover andmutation rates

First graphs showing the ATC values Top-5 Top-10Top-15 and Top-20 query plans averaged over four datasetsover 1000 generations for distinct pairs of crossover andmutation probability 119875

119888 119875119898 = 07 001 07 0005

075 001 075 0005 08 001 08 0005 085 001085 0005 using NSGA-II based DQPG algorithm(DQPGNSGA) were plotted These are shown in Figures 45 6 and 7 It can be observed from these graphs that the

0

005

01

015

02

025

03

035

04

045

05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(after 1000 generations)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGNSGA

Top-K query plans

Figure 8

0

01

02

03

04

05

06

07

08

09

1

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-5 query plans)

Figure 9

convergence to ATC is lowest in the case of 119875119888 119875119898 =

085 001 Furthermore the graph showing ATC valuesaveraged over four datasets versus Top-119870 query plans after1000 generations (Figure 8) also shows that the lowest ATCvalues achieved are for 119875

119888 119875119898 = 085 001 Thus it

can be said that DQPGNSGA performs reasonably well for119875119888 119875119898 = 085 001 In order to compare DQPGNSGA with

SGA based DQPG algorithm (DQPGSGA) similar graphswere plotted for DQPGSGA These are shown in Figures 9 1011 12 and 13 It is noted from these graphs that DQPGSGA

The Scientific World Journal 13

0

02

04

06

08

1

12

14

1618

21 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-10 query plans)

Figure 10

0

02

04

06

08

1

12

14

16

18

2

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-15 query plans)

Figure 11

also achieves convergence to the lowest ATC values for119875119888 119875119898 = 085 001

Since the two algorithms DQPGNSGA and DQPGSGAconverge to a lower ATC value for the same crossover andmutation probabilities that is 119875

119888 119875119898 = 085 001 the

comparisons of the two algorithms can be carried out forthese observed probabilities

First the two algorithmsDQPGNSGA andDQPGSGA werecompared for each of the four datasets (Dataset-1 Dataset-2 Dataset-3 and Dataset-4) on the ATC values of Top-5Top-10 Top-15 and Top-20 query plans generated over 1000

0

04

08

12

16

2

24

28

32

36

4

1 54 107

160

213

266

319

372

425

478

531

584

637

690

743

796

849

902

955

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-20 query plans)

Figure 12

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

(after 1000 generations)

Aver

age t

otal

cost

(ATC

)

DQPGSGA

Top-K query plans

Figure 13

generations for 119875119888 119875119898 = 085 001 The corresponding

graphs for each dataset are shown in Figures 14 15 16 and 17It can be observed from the graphs thatDQPGNSGA convergesto lower ATC values for Top-5 Top-10 Top-15 and Top-20query plans generated for all datasets

Furthermore graphs comparing the ATC values of Top-119870 query plans produced by DQPGNSGA and DQPGSGA after1000 generations for the four datasets were plotted andare shown in Figures 18 19 20 and 21 These graphs alsoshow average TCC (ATCC) and average TPC (ATPC) valuesof Top-119870 query plans generated by DQPGNSGA It can beobserved from these graphs thatDQPGNSGA is able to achieve

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

10 The Scientific World Journal

Table 7 Front assignment for query plans in IP

Index Intermediate population IP TCC TPC Front[1] [2 1 5 1] 18020000 3747000

2

[2] [3 3 1 1] 6720000 6006000 3

[3] [3 3 3 3] 0 4200000 1

[4] [2 4 1 5] 43440000 7079000 4

[5] [3 1 1 3] 14000000 2112000 1

[6] [3 5 4 5] 17020000 3847000 2

[7] [3 1 1 1] 960000 6500000 2

[8] [3 1 1 3] 1020000 9750000 3

[9] [2 3 3 3] 1110000 9750000 3

[10] [2 1 3 4] 45090000 10514000 5

[11] [2 1 1 1] 990000 6500000 2

[12] [3 3 1 1] 6720000 6006000 3

[13] [3 3 4 1] 90300000 8946000 5

[14] [3 3 1 1] 6720000 6006000 3

[15] [3 3 3 1] 2520000 4230000 2

[16] [3 3 3 3] 0 4200000 1

[17] [3 1 1 3] 14000000 2112000 1

[18] [3 3 1 3] 4500000 2640000 1

[19] [3 3 3 3] 0 4200000 1

[20] [3 3 1 1] 6720000 6006000 3

Table 8 Nondominated sorting for query plans in IP

Index Intermediatepopulation IP Front Crowding

distance[3] [3 3 3 3]

1

[5] [3 1 1 3] 1

[16] [3 3 3 3] 1

[17] [3 1 1 3] 1

[18] [3 3 1 3] 1

[19] [3 3 3 3] 1

[1] [2 1 5 1] 2 infin

[7] [3 1 1 1] 2 infin

[11] [2 1 1 1] 2 infin

[15] [3 3 3 1] 2 04933

[6] [3 5 4 5] 2 02292

[2] [3 3 1 1] 3

[9] [2 3 3 3] 3

[8] [3 1 1 3] 3

[12] [3 3 1 1] 3

[14] [3 3 1 1] 3

[20] [3 3 1 1] 3

[4] [2 4 1 5] 4

[10] [2 1 3 4] 5

[13] [3 3 4 1] 4

= 2 times (

200 times 300

05 times 200

times

700

03 times (200 times 300) (05 times 200)

times

500

01 times 500

) times 140 = 65333333

(12)

Table 9 Population PP in the 2nd generation

Index Population PP[1] [3 3 3 3]

[2] [3 1 1 3]

[3] [3 3 3 3]

[4] [3 1 1 3]

[5] [3 3 1 3]

[6] [3 3 3 3]

[7] [2 1 5 1]

[8] [3 1 1 1]

[9] [2 1 1 1]

[10] [3 3 3 1]

So

TPC4= 126000 + 420000 + 65333333 = 70793333

TCC4= 1020000 + 6720000 + 35700000 = 43440000

(13)

Similarly TCC and TPC values of the other nine queryplans are computed Consider TCC and TPC of the 10 queryplans are given in Table 2 The population is then sorted intodifferent nondominated fronts as described in Step 3 of theproposed algorithm For example for query plan 1 that is[2 1 5 1] the set 119878

1= number of query plans that dominate

query plan 1 Since TCC [1] lt TCC [4] TPC [1] lt TPC [4]TCC [1] lt TCC [10] and TPC [1] lt TPC [10] the elementsof 1198781

= 4 10 Similarly the sets 1198782 119878

10are computed

and are given in Table 2 119899119894stores the count of query plans

that dominate 119894 So using the values in 119878119894 1198991

= 1 as onlyquery plan 5 is dominating 1 and 119899

2= 1 as only query plan 3

is dominating 2 Similarly 1198992 119899

10are computed and are

The Scientific World Journal 11

given in Table 2 From Table 2 it can be noted that queryplans 3 and 5 are not dominated as 119899

3= 0 119899

5= 0 So they

are assigned to the first nondominated front1The elementsin the next front are computed by reducing the count in 119899

119894

for each 119894 isin 1198783and 119894 isin 119878

5 So 119899

119894= 0 0 minus 4 minus 0 0 1 1 7

So the second front has query plans 1 2 6 and 7 Thisprocess continues till all the query plans in the population areassigned to their respective nondominated fronts The fronts1 2 3 and

4are formed and are given in Table 2

Finally the query plans are sorted separately on the valuesof TCC and TPC within each front as shown in Table 3

After the population is sorted into different fronts thecrowding distance (119868[119889119894119904119905119886119899119888119890]) computation is performedfor each query plan using the formula given in Step 4 ofthe proposed algorithm Query plans having the maximumand minimum values in each front are assigned infin distancevalues that is 119868[3] 119868[5] 119868[7] 119868[1] 119868[8] 119868[4] 119868[9] and119868[10] are assigned infin For the rest of the query plans theircrowding distance (CD) values are computed based on thesorted order of query plans in each front Initially they areassigned 119868[119889119894119904119905119886119899119888119890] = 0 For query plan 2 in front

2 the

CD computation is performed as follows

1198681198981[2] = 119868 [2] +

TCC [6] minus TCC [7]

max (TCC) minusmin (TCC)

= 0 +

17020000 minus 960000

45090000 minus 0

= 03562

119868 [2] = 1198681198981[2] +

TPC [7] minus TPC [6]

max (TPC) minusmin (TPC)

= 03562 +

6500000 minus 3847000

10514000 minus 2112000

= 06720

(14)

CD of the query plans in the given population is given inTable 4

Next binary tournament selection is performed on thepopulation on the basis of crowded comparison operator ≺

119899

This selection process is shown in Table 5The selected query plans undergo random single point

crossover with crossover probability 119875119888

= 05 Mutation isperformed on the selected population with mutation proba-bility 119875

119898= 002 The child population CP after crossover and

mutation is shown in Table 6Now in accordance with NSGA-II algorithm the popu-

lations from the second generation onward have to ensureelitism For this purpose the child population CP is com-bined with the parent population PP to generate intermediatepopulation IP This population is subjected to nondominatedsort and fronts are formed as given in Table 7

The population for the second generation is arrived atby selecting query plans based on front and within it basedon crowding distance-wise as described in Step 7 from theintermediate population IP till the actual population size 10is exceeded This selection is shown in Table 8

The population PP for the second generation is given inTable 9

The above process is repeated till a predefined number ofgenerations119866have elapsedThereafter Top-119870 query plans areproduced as output

0

01

02

03

04

05

06

07

08

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-5 query plans)

Figure 4

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

DQPGNSGA(top-10 query plans)

Figure 5

5 Experimental Results

The proposed NSGA-II based algorithm is implemented inMATLAB 77 in Windows 7 professional 64 bit OS with Intelcore i3CPUat 213GHzhaving 4GBRAMExperimentswerecarried out for a population of 100 query plans with eachquery plan involving 10 relations distributed over 50 sitesThese were performed on four datasets each comprising adifferent relation-site matrix Graphs were plotted to observechange in average TC (ATC) with respect to generations

12 The Scientific World Journal

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-15 query plans)

Figure 6

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-20 query plans)

Figure 7

and Top-119870 query plans for different pairs of crossover andmutation rates

First graphs showing the ATC values Top-5 Top-10Top-15 and Top-20 query plans averaged over four datasetsover 1000 generations for distinct pairs of crossover andmutation probability 119875

119888 119875119898 = 07 001 07 0005

075 001 075 0005 08 001 08 0005 085 001085 0005 using NSGA-II based DQPG algorithm(DQPGNSGA) were plotted These are shown in Figures 45 6 and 7 It can be observed from these graphs that the

0

005

01

015

02

025

03

035

04

045

05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(after 1000 generations)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGNSGA

Top-K query plans

Figure 8

0

01

02

03

04

05

06

07

08

09

1

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-5 query plans)

Figure 9

convergence to ATC is lowest in the case of 119875119888 119875119898 =

085 001 Furthermore the graph showing ATC valuesaveraged over four datasets versus Top-119870 query plans after1000 generations (Figure 8) also shows that the lowest ATCvalues achieved are for 119875

119888 119875119898 = 085 001 Thus it

can be said that DQPGNSGA performs reasonably well for119875119888 119875119898 = 085 001 In order to compare DQPGNSGA with

SGA based DQPG algorithm (DQPGSGA) similar graphswere plotted for DQPGSGA These are shown in Figures 9 1011 12 and 13 It is noted from these graphs that DQPGSGA

The Scientific World Journal 13

0

02

04

06

08

1

12

14

1618

21 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-10 query plans)

Figure 10

0

02

04

06

08

1

12

14

16

18

2

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-15 query plans)

Figure 11

also achieves convergence to the lowest ATC values for119875119888 119875119898 = 085 001

Since the two algorithms DQPGNSGA and DQPGSGAconverge to a lower ATC value for the same crossover andmutation probabilities that is 119875

119888 119875119898 = 085 001 the

comparisons of the two algorithms can be carried out forthese observed probabilities

First the two algorithmsDQPGNSGA andDQPGSGA werecompared for each of the four datasets (Dataset-1 Dataset-2 Dataset-3 and Dataset-4) on the ATC values of Top-5Top-10 Top-15 and Top-20 query plans generated over 1000

0

04

08

12

16

2

24

28

32

36

4

1 54 107

160

213

266

319

372

425

478

531

584

637

690

743

796

849

902

955

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-20 query plans)

Figure 12

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

(after 1000 generations)

Aver

age t

otal

cost

(ATC

)

DQPGSGA

Top-K query plans

Figure 13

generations for 119875119888 119875119898 = 085 001 The corresponding

graphs for each dataset are shown in Figures 14 15 16 and 17It can be observed from the graphs thatDQPGNSGA convergesto lower ATC values for Top-5 Top-10 Top-15 and Top-20query plans generated for all datasets

Furthermore graphs comparing the ATC values of Top-119870 query plans produced by DQPGNSGA and DQPGSGA after1000 generations for the four datasets were plotted andare shown in Figures 18 19 20 and 21 These graphs alsoshow average TCC (ATCC) and average TPC (ATPC) valuesof Top-119870 query plans generated by DQPGNSGA It can beobserved from these graphs thatDQPGNSGA is able to achieve

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

The Scientific World Journal 11

given in Table 2 From Table 2 it can be noted that queryplans 3 and 5 are not dominated as 119899

3= 0 119899

5= 0 So they

are assigned to the first nondominated front1The elementsin the next front are computed by reducing the count in 119899

119894

for each 119894 isin 1198783and 119894 isin 119878

5 So 119899

119894= 0 0 minus 4 minus 0 0 1 1 7

So the second front has query plans 1 2 6 and 7 Thisprocess continues till all the query plans in the population areassigned to their respective nondominated fronts The fronts1 2 3 and

4are formed and are given in Table 2

Finally the query plans are sorted separately on the valuesof TCC and TPC within each front as shown in Table 3

After the population is sorted into different fronts thecrowding distance (119868[119889119894119904119905119886119899119888119890]) computation is performedfor each query plan using the formula given in Step 4 ofthe proposed algorithm Query plans having the maximumand minimum values in each front are assigned infin distancevalues that is 119868[3] 119868[5] 119868[7] 119868[1] 119868[8] 119868[4] 119868[9] and119868[10] are assigned infin For the rest of the query plans theircrowding distance (CD) values are computed based on thesorted order of query plans in each front Initially they areassigned 119868[119889119894119904119905119886119899119888119890] = 0 For query plan 2 in front

2 the

CD computation is performed as follows

1198681198981[2] = 119868 [2] +

TCC [6] minus TCC [7]

max (TCC) minusmin (TCC)

= 0 +

17020000 minus 960000

45090000 minus 0

= 03562

119868 [2] = 1198681198981[2] +

TPC [7] minus TPC [6]

max (TPC) minusmin (TPC)

= 03562 +

6500000 minus 3847000

10514000 minus 2112000

= 06720

(14)

CD of the query plans in the given population is given inTable 4

Next binary tournament selection is performed on thepopulation on the basis of crowded comparison operator ≺

119899

This selection process is shown in Table 5The selected query plans undergo random single point

crossover with crossover probability 119875119888

= 05 Mutation isperformed on the selected population with mutation proba-bility 119875

119898= 002 The child population CP after crossover and

mutation is shown in Table 6Now in accordance with NSGA-II algorithm the popu-

lations from the second generation onward have to ensureelitism For this purpose the child population CP is com-bined with the parent population PP to generate intermediatepopulation IP This population is subjected to nondominatedsort and fronts are formed as given in Table 7

The population for the second generation is arrived atby selecting query plans based on front and within it basedon crowding distance-wise as described in Step 7 from theintermediate population IP till the actual population size 10is exceeded This selection is shown in Table 8

The population PP for the second generation is given inTable 9

The above process is repeated till a predefined number ofgenerations119866have elapsedThereafter Top-119870 query plans areproduced as output

0

01

02

03

04

05

06

07

08

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-5 query plans)

Figure 4

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

DQPGNSGA(top-10 query plans)

Figure 5

5 Experimental Results

The proposed NSGA-II based algorithm is implemented inMATLAB 77 in Windows 7 professional 64 bit OS with Intelcore i3CPUat 213GHzhaving 4GBRAMExperimentswerecarried out for a population of 100 query plans with eachquery plan involving 10 relations distributed over 50 sitesThese were performed on four datasets each comprising adifferent relation-site matrix Graphs were plotted to observechange in average TC (ATC) with respect to generations

12 The Scientific World Journal

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-15 query plans)

Figure 6

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-20 query plans)

Figure 7

and Top-119870 query plans for different pairs of crossover andmutation rates

First graphs showing the ATC values Top-5 Top-10Top-15 and Top-20 query plans averaged over four datasetsover 1000 generations for distinct pairs of crossover andmutation probability 119875

119888 119875119898 = 07 001 07 0005

075 001 075 0005 08 001 08 0005 085 001085 0005 using NSGA-II based DQPG algorithm(DQPGNSGA) were plotted These are shown in Figures 45 6 and 7 It can be observed from these graphs that the

0

005

01

015

02

025

03

035

04

045

05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(after 1000 generations)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGNSGA

Top-K query plans

Figure 8

0

01

02

03

04

05

06

07

08

09

1

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-5 query plans)

Figure 9

convergence to ATC is lowest in the case of 119875119888 119875119898 =

085 001 Furthermore the graph showing ATC valuesaveraged over four datasets versus Top-119870 query plans after1000 generations (Figure 8) also shows that the lowest ATCvalues achieved are for 119875

119888 119875119898 = 085 001 Thus it

can be said that DQPGNSGA performs reasonably well for119875119888 119875119898 = 085 001 In order to compare DQPGNSGA with

SGA based DQPG algorithm (DQPGSGA) similar graphswere plotted for DQPGSGA These are shown in Figures 9 1011 12 and 13 It is noted from these graphs that DQPGSGA

The Scientific World Journal 13

0

02

04

06

08

1

12

14

1618

21 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-10 query plans)

Figure 10

0

02

04

06

08

1

12

14

16

18

2

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-15 query plans)

Figure 11

also achieves convergence to the lowest ATC values for119875119888 119875119898 = 085 001

Since the two algorithms DQPGNSGA and DQPGSGAconverge to a lower ATC value for the same crossover andmutation probabilities that is 119875

119888 119875119898 = 085 001 the

comparisons of the two algorithms can be carried out forthese observed probabilities

First the two algorithmsDQPGNSGA andDQPGSGA werecompared for each of the four datasets (Dataset-1 Dataset-2 Dataset-3 and Dataset-4) on the ATC values of Top-5Top-10 Top-15 and Top-20 query plans generated over 1000

0

04

08

12

16

2

24

28

32

36

4

1 54 107

160

213

266

319

372

425

478

531

584

637

690

743

796

849

902

955

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-20 query plans)

Figure 12

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

(after 1000 generations)

Aver

age t

otal

cost

(ATC

)

DQPGSGA

Top-K query plans

Figure 13

generations for 119875119888 119875119898 = 085 001 The corresponding

graphs for each dataset are shown in Figures 14 15 16 and 17It can be observed from the graphs thatDQPGNSGA convergesto lower ATC values for Top-5 Top-10 Top-15 and Top-20query plans generated for all datasets

Furthermore graphs comparing the ATC values of Top-119870 query plans produced by DQPGNSGA and DQPGSGA after1000 generations for the four datasets were plotted andare shown in Figures 18 19 20 and 21 These graphs alsoshow average TCC (ATCC) and average TPC (ATPC) valuesof Top-119870 query plans generated by DQPGNSGA It can beobserved from these graphs thatDQPGNSGA is able to achieve

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

12 The Scientific World Journal

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-15 query plans)

Figure 6

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

0

01

02

03

04

05

06

07

08

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGNSGA(top-20 query plans)

Figure 7

and Top-119870 query plans for different pairs of crossover andmutation rates

First graphs showing the ATC values Top-5 Top-10Top-15 and Top-20 query plans averaged over four datasetsover 1000 generations for distinct pairs of crossover andmutation probability 119875

119888 119875119898 = 07 001 07 0005

075 001 075 0005 08 001 08 0005 085 001085 0005 using NSGA-II based DQPG algorithm(DQPGNSGA) were plotted These are shown in Figures 45 6 and 7 It can be observed from these graphs that the

0

005

01

015

02

025

03

035

04

045

05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(after 1000 generations)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGNSGA

Top-K query plans

Figure 8

0

01

02

03

04

05

06

07

08

09

1

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-5 query plans)

Figure 9

convergence to ATC is lowest in the case of 119875119888 119875119898 =

085 001 Furthermore the graph showing ATC valuesaveraged over four datasets versus Top-119870 query plans after1000 generations (Figure 8) also shows that the lowest ATCvalues achieved are for 119875

119888 119875119898 = 085 001 Thus it

can be said that DQPGNSGA performs reasonably well for119875119888 119875119898 = 085 001 In order to compare DQPGNSGA with

SGA based DQPG algorithm (DQPGSGA) similar graphswere plotted for DQPGSGA These are shown in Figures 9 1011 12 and 13 It is noted from these graphs that DQPGSGA

The Scientific World Journal 13

0

02

04

06

08

1

12

14

1618

21 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-10 query plans)

Figure 10

0

02

04

06

08

1

12

14

16

18

2

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-15 query plans)

Figure 11

also achieves convergence to the lowest ATC values for119875119888 119875119898 = 085 001

Since the two algorithms DQPGNSGA and DQPGSGAconverge to a lower ATC value for the same crossover andmutation probabilities that is 119875

119888 119875119898 = 085 001 the

comparisons of the two algorithms can be carried out forthese observed probabilities

First the two algorithmsDQPGNSGA andDQPGSGA werecompared for each of the four datasets (Dataset-1 Dataset-2 Dataset-3 and Dataset-4) on the ATC values of Top-5Top-10 Top-15 and Top-20 query plans generated over 1000

0

04

08

12

16

2

24

28

32

36

4

1 54 107

160

213

266

319

372

425

478

531

584

637

690

743

796

849

902

955

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-20 query plans)

Figure 12

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

(after 1000 generations)

Aver

age t

otal

cost

(ATC

)

DQPGSGA

Top-K query plans

Figure 13

generations for 119875119888 119875119898 = 085 001 The corresponding

graphs for each dataset are shown in Figures 14 15 16 and 17It can be observed from the graphs thatDQPGNSGA convergesto lower ATC values for Top-5 Top-10 Top-15 and Top-20query plans generated for all datasets

Furthermore graphs comparing the ATC values of Top-119870 query plans produced by DQPGNSGA and DQPGSGA after1000 generations for the four datasets were plotted andare shown in Figures 18 19 20 and 21 These graphs alsoshow average TCC (ATCC) and average TPC (ATPC) valuesof Top-119870 query plans generated by DQPGNSGA It can beobserved from these graphs thatDQPGNSGA is able to achieve

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 13: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

The Scientific World Journal 13

0

02

04

06

08

1

12

14

1618

21 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-10 query plans)

Figure 10

0

02

04

06

08

1

12

14

16

18

2

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Generation07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

Aver

age t

otal

cost

(ATC

)

DQPGSGA(top-15 query plans)

Figure 11

also achieves convergence to the lowest ATC values for119875119888 119875119898 = 085 001

Since the two algorithms DQPGNSGA and DQPGSGAconverge to a lower ATC value for the same crossover andmutation probabilities that is 119875

119888 119875119898 = 085 001 the

comparisons of the two algorithms can be carried out forthese observed probabilities

First the two algorithmsDQPGNSGA andDQPGSGA werecompared for each of the four datasets (Dataset-1 Dataset-2 Dataset-3 and Dataset-4) on the ATC values of Top-5Top-10 Top-15 and Top-20 query plans generated over 1000

0

04

08

12

16

2

24

28

32

36

4

1 54 107

160

213

266

319

372

425

478

531

584

637

690

743

796

849

902

955

Generation

Aver

age t

otal

cost

(ATC

)

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

DQPGSGA(top-20 query plans)

Figure 12

0

01

02

03

04

05

06

07

08

09

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

07 00107 0005075 001075 0005

08 00108 0005085 001085 0005

(after 1000 generations)

Aver

age t

otal

cost

(ATC

)

DQPGSGA

Top-K query plans

Figure 13

generations for 119875119888 119875119898 = 085 001 The corresponding

graphs for each dataset are shown in Figures 14 15 16 and 17It can be observed from the graphs thatDQPGNSGA convergesto lower ATC values for Top-5 Top-10 Top-15 and Top-20query plans generated for all datasets

Furthermore graphs comparing the ATC values of Top-119870 query plans produced by DQPGNSGA and DQPGSGA after1000 generations for the four datasets were plotted andare shown in Figures 18 19 20 and 21 These graphs alsoshow average TCC (ATCC) and average TPC (ATPC) valuesof Top-119870 query plans generated by DQPGNSGA It can beobserved from these graphs thatDQPGNSGA is able to achieve

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 14: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

14 The Scientific World Journal

0

05

1

15

2

25

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-1)

Figure 14

0

03

06

09

12

15

18

21

24

GenerationSGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-2)

Figure 15

an acceptable tradeoff between ATPC and ATCC which inturn leads to a comparatively lower ATC for the Top-119870 queryplans generated by it

Next a graph comparing the ATC values of Top-119870 queryplans generated by DQPGNSGA and DQPGSGA on all fourdatasets (DS-1 DS-2 DS-3 and DS-4) after 1000 generationsfor observed probabilities 119875

119888 119875119898 = 085 001were plotted

and is shown in Figure 22 It is noted from the graph thatDQPGNSGA performs better than DQPGSGA on the ATCvalues of Top-119870 query plans generated by the two algorithmsfor each of the four data sets

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

0

015

03

045

06

075

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-3)

Figure 16

SGA-top 5SGA-top 10SGA-top 15SGA-top 20

NSGA-top 5NSGA-top 10NSGA-top 15NSGA-top 20

Generation

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

704

741

778

815

852

889

926

963

1000

0

01

02

03

04

05

06

07

08

09

1

Aver

age t

otal

cost

(ATC

)

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(Dataset-4)

Figure 17

It can be reasonably inferred from all the above graphsthat DQPGNSGA is able to generate Top-119870 query plans withlower ATC when compared to those generated byDQPGSGAThis may be attributed to acceptable tradeoffs achieved whilesimultaneously optimizing TPC and TCC which results inlower TC in case of DQPGNSGA

6 Conclusion

In this paper DQPGproblemgiven in [3] has been addressedwhere query plans are generated for a distributed relational

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 15: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

The Scientific World Journal 15

0

005

01

015

02

025

03

035

04

045

1 3 5 7 9 11 13 15 17 19

Cos

t

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-1)

Figure 18

0

01

02

03

04

05

06

07

Cos

t

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

Top-K query plans

(Dataset-2)

Figure 19

query that incurs minimum total query processing costGenetic algorithms have been used to generate these queryplansThe total query processing cost TC in [3] can be viewedas comprising broadly of TPC and TCC and thereforeminimizing TPC and TCC would result in minimizing TCThus in this paper the single-objective DQPG problemin [3] has been formulated and solved as a biobjectiveDQPG problem with the two objectives being minimizingTPC and minimizing TCC These objectives are minimizedsimultaneously using the multiobjective genetic algorithmNSGA-II

Experiments were performed and DQPGNSGA is com-pared with DQPGSGA given in [3] It was observed thatboth the algorithms individually gave good results for thecrossover and mutation probabilities 085 and 001 respec-tively The two algorithms were then compared on the ATC

0

002

004

006

008

01

012

014

016

Cos

t

1 3 5 7 9 11 13 15 17 19

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-3)

Figure 20

0

005

01

015

02

025

Cos

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ATC-SGAATCC-NSGA-II

ATPC-NSGA-IIATC-NSGA-II

DQPGNSGA versus DQPGSGAPc Pm = 085 001 (after 1000 generations)

Top-K query plans

(Dataset-4)

Figure 21

values of the Top-119870 query plans generated by them for theobserved crossover and mutation probabilities The resultsshowed that DQPGNSGA performed better than DQPGSGAAlso the performance of the former was better when the twoalgorithmswere compared on theATCvalues of Top-119870 queryplansThe better performance of DQPGNSGA over DQPGSGAmay be attributed to DQPGNSGA achieving acceptable trade-offs between TPC and TCC while minimizing TPC and TCCof Top-119870 query plans simultaneously

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 16: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

16 The Scientific World Journal

Top-K query plans

0

01

02

03

04

05

06

07

Aver

age t

otal

cost

(ATC

)

SGA(DS-1)SGA(DS-2)SGA(DS-3)SGA(DS-4)

NSGA(DS-1)NSGA(DS-2)NSGA(DS-3)NSGA(DS-4)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

DQPGNSGA versus DQPGSGAPc Pm = 085 001

(after 1000 generations)

Figure 22

References

[1] S Ceri and G Pelgatti Distributed Databases Principles ampSystems International Edition McGraw-Hill Computer ScienceSeries 1985

[2] A R Hevner and S B Yao ldquoQuery processing in distributeddatabase systemsrdquo IEEE Transactions on Software Engineeringvol 5 no 3 pp 177ndash187 1979

[3] T V Vijay Kumar and S Panicker ldquoGenerating query plansfor distributed query processing using genetic algorithmrdquo inInformation Computing and Applications vol 7030 of LectureNotes in Computer Science pp 765ndash772 2011

[4] K Deb A Pratap S Agarwal and T Meyarivan ldquoA fastand elitist multiobjective genetic algorithm NSGA-IIrdquo IEEETransactions on Evolutionary Computation vol 6 no 2 pp 182ndash197 2002

[5] P Black and W Luk ldquoA new heuristic for generating semi-joinprograms for distributed query processingrdquo in Proceedings ofthe IEEE 6th International Computer Software and ApplicationConference pp 581ndash588 IEEE Chicago Ill USA November1982

[6] P Bodorik and J S Riordon ldquoA threshold mechanism fordistributed query processingrdquo in Proceedings of the 16th AnnualConference on Computer Science pp 616ndash625 Atlanta GaUnited States February 1988

[7] J Chang ldquoA heuristic approach to distributed query process-ingrdquo in Proceedings of the 8th International Conference on VeryLarge Data Bases (VLDB rsquo82) pp 54ndash61 Saratoga Calif USA1982

[8] D Kossmann ldquoThe state of the art in distributed queryprocessingrdquo ACM Computing Surveys vol 32 no 4 pp 422ndash469 2000

[9] L Stiphane and W Eugene ldquoA state transition model fordistributed query processingrdquo ACM Transactions on DatabaseSystems vol 11 no 3 pp 249ndash322 1986

[10] T V Vijay Kumar V Singh and A K Verma ldquoDistributedquery processing plans generation using genetic algorithmrdquo

International Journal of Computer Theory and Engineering vol3 no 1 pp 38ndash45 2011

[11] C T Yu and C C Chang ldquoDistributed query processingrdquoACMComputing Surveys vol 16 no 4 pp 399ndash433 1984

[12] K Bennett M C Ferris and Y E Ioannidis ldquoA Geneticalgorithm for database query optimizationrdquo in Proceedings ofthe 4th International Conference onGenetic Algorithms pp 400ndash407 1991

[13] Y E Ioannidis and Y Cha Kang ldquoRandomized algorithms foroptimizing large join queriesrdquo in Proceedings of the 1990 ACMSIGMOD International Conference on Management of Data pp312ndash321 May 1990

[14] Y Ioannidis and E Wong ldquoQuery optimization by simulatedannealingrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD rsquo87) pp 9ndash221987

[15] A A R Townsend Genetic Algorithms A Tutorial 2003[16] M Gregory Genetic Algorithm Optimization of Distributed

Database Queries IEEE 1998[17] M Mitchell An Introduction to Genetic Algorithms The MIT

Press Cambridge Mass USA 1999[18] J D Schaffer ldquoMultiple objective optimization with vector

evaluated genetic algorithmsrdquo inProceedings of the InternationalConference on Genetic Algorithm andTheir Applications pp 93ndash100 July 1985

[19] V Guliashki H Toshev andC Korsemov ldquoSurvey of evolution-ary algorithms used in multiobjective optimizationrdquo Problemsof EngineeringCybernetics andRobotics vol 60 pp 42ndash54 2009

[20] C M Fonseca and P J Fleming ldquoGenetic algorithms formultiobjective optimization formulation discussion and gen-eralizationrdquo inProceedings of the 5th International Conference onGenetic Algorithms S Forrest Ed pp 416ndash423 San FranciscoCalif USA 1993

[21] J Horn N Nafpliotis and D E Goldberg ldquoA niched Paretogenetic algorithm for multiobjective optimizationrdquo in Proceed-ings of the 1st IEEE Conference on Evolutionary ComputationIEEE World Congress on Computational Intelligence pp 82ndash87IEEE Orlando Fla USA June 1994

[22] N Srinivas and K Deb ldquoMultiobjective optimization usingnondominated sorting in genetic algorithmsrdquo EvolutionaryComputation vol 2 no 3 pp 221ndash248 1994

[23] E Zitzler and L Thiele ldquoMultiobjective evolutionary algo-rithms a comparative case study and the strength Paretoapproachrdquo IEEETransactions onEvolutionaryComputation vol3 no 4 pp 257ndash271 1999

[24] J D Knowles and D W Corne ldquoApproximating the non-dominated front using the Pareto archived evolution strategyrdquoEvolutionary computation vol 8 no 2 pp 149ndash172 2000

[25] D W Corne J D Knowles and M J Oates ldquoThe Paretoenvelope-based selection algorithm for multiobjective opti-mizationrdquo in Proceedings of the 6th International Conference onParallel Problem Solving from Nature Springer Paris FranceSeptember 2000

[26] J Scharnow K Tinnefeld and I Wegener ldquoThe analysis of evo-lutionary algorithms on sorting and shortest paths problemsrdquoJournal of Mathematical Modelling and Algorithms vol 3 no 4pp 349ndash366 2004

[27] B Doerr E Happ and C Klein ldquoCrossover can provablybe useful in evolutionary computationrdquo in Proceedings of the10th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo08) pp 539ndash546 ACM Press July 2008

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 17: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

The Scientific World Journal 17

[28] C Horoba ldquoAnalysis of a simple evolutionary algorithm for themultiobjective shortest path problemrdquo in Proceedings of the 10thACM SIGEVO Workshop on Foundations of Genetic Algorithms(FOGA rsquo09) pp 113ndash120 ACM Press New York NY USAJanuary 2009

[29] MTheile ldquoExact solutions to the traveling salesperson problemby a population-based evolutionary algorithmrdquo in Proceedingsof the 9th European Conference on Evolutionary Computation inCombinatorial Optimisation (EvoCOP rsquo09) vol 5482 of LectureNotes in Computer Science pp 145ndash155 Springer April 2009

[30] A V Eremeev On Linking Dynamic Programming and Multi-Objective Evolutionary Algorithms Omsk StateUniversity 2008

[31] A V Eremeev ldquoA fully polynomial randomized approximationscheme based on an evolutionary algorithmrdquo Diskretnyi Analizi Issledovanie Operatsii vol 17 no 4 pp 3ndash17 2010

[32] T Friedrich N Hebbinghaus F Neumann J He and CWitt ldquoApproximating covering problems by randomized searchheuristics using multi-objective modelsrdquo in Proceedings of the9th Annual Genetic and Evolutionary Computation Conference(GECCO rsquo07) pp 797ndash804 ACM Press July 2007

[33] E Elbeltagi T Hegazy and D Grierson ldquoA modified shuffledfrog-leaping optimization algorithm applications to projectmanagementrdquo Structure and Infrastructure Engineering vol 3no 1 pp 53ndash60 2007

[34] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

[35] D E Goldberg and K Deb ldquoA comparative analysis of selectionschemes used in genetic algorithmsrdquo in Foundations of GeneticAlgorithms pp 69ndash93 Morgan Kaufmann San Mateo CalifUSA 1991

[36] D E GoldbergGenetic Algorithms in Search Optimization andMachine Learning Addison-Wesley Reading Mass USA 1989

[37] H Dong and Y Liang ldquoGenetic algorithms for large join queryoptimizationrdquo in Proceedings of the 9th Annual Genetic andEvolutionary Computation Conference (GECCO rsquo07) pp 1211ndash1218 London UK July 2007

[38] I F Sbalzariniy M Sibylle and P Koumoutsakosyz ldquoMulti-objective optimization using evolutionary algorithmsrdquo Centerfor Turbulence Research Proceedings of the Summer Program2000

[39] E Zitzler and L Thiele ldquoMultiobjective optimization usingevolutionary algorithmsmdasha comparative case studyrdquo in ParallelProblem Solving from Nature pp 292ndash301 Springer Amster-dam The Netherlands 1998

[40] C A C Coello ldquoAn updated survey of GA-basedmultiobjectiveoptimization techniquesrdquo ACM Computing Surveys vol 32 no2 pp 137ndash143 2000

[41] D A Van Veldhuizen and G B Lamont ldquoMultiobjective evolu-tionary algorithms analyzing the state-of-the-artrdquo Evolutionarycomputation vol 8 no 2 pp 125ndash147 2000

[42] F Glover and S Hanafi ldquoTabu search and finite convergencerdquoDiscrete Applied Mathematics vol 119 no 1-2 pp 3ndash36 2002

[43] A C Nearchou ldquoThe effect of various operators on the geneticsearch for large scheduling problemsrdquo International Journal ofProduction Economics vol 88 no 2 pp 191ndash203 2004

[44] W W Chu and P Hurley ldquoOptimal query processing fordistributed database systemsrdquo IEEE Transactions on Computersvol 31 no 9 pp 835ndash850 1982

[45] M Serpell and J E Smith ldquoSelf-adaptation ofmutation operatorand probability for permutation representations in genetic

algorithmsrdquo Evolutionary Computation vol 18 no 3 pp 491ndash514 2010

[46] A Seshadri A Fast Elitist Multiobjective Genetic AlgorithmNSGA-II MATLAB Central 2006

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 18: Research Article Distributed Query Plan Generation Using ...downloads.hindawi.com › journals › tswj › 2014 › 628471.pdf · A distributed query processing strategy, which is

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014