[acm press the third international workshop - glasgow, scotland, uk (2011.10.28-2011.10.28)]...

8
Efficient Data Distribution Strategy for Join Query Processing in the Cloud Haiping Wang School of Information Renmin University of China Beijing, China [email protected] Xiaofeng Meng School of Information Renmin University of China Beijing, China [email protected] Yunpeng Chai School of Information Renmin University of China Beijing, China [email protected] ABSTRACT There are many advantages for large scale data management in the cloud. More and more companies start to migrate their data into cloud data management systems. Join query becomes a challenging research problem in cloud. To fin- ish a join query in the cloud, data among different nodes need to be transferred. The arrangement of data transmis- sion and local data processing is known as a distribution strategy for a query. The transmission cost (network work- load between servers and the transmission time delay) will be very high if the strategy is not properly chosen. Exist- ing cloud systems either do not support join query or just use MapReduce to support some simple join queries. The problem of using redundant data for join query optimization in cloud environment is studied in this paper. Two novel algorithms, Set Cover based algorithm(SC) and Minimum Element based algorithm(ME), are proposed to reduce data transmission cost. The experiment results demonstrate that the proposed methods can greatly reduce the data trans- mission cost compared with the naive method. Besides, the result is very close to the optimal strategy. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Distributed databases; Query processing General Terms Algorithms Keywords cloud computing, distribution strategy, replicate, join 1. INTRODUCTION As a new trend of data management, cloud computing has many advantages, more and more companies start to mi- grate their data into cloud systems and sometimes do join Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CloudDB’11, October 28, 2011, Glasgow, Scotland, UK. Copyright 2011 ACM 978-1-4503-0956-1/11/10 ...$10.00. query based on them. To finish a join query in distributed environment, data among different nodes need to be trans- ferred, the transmission cost(the size of data transferred be- tween servers and the transmission time delay) will not be affordable if the join query plan is not properly arranged. What’s more, most of the cloud systems use Bigtable[1] data model to store data, tables are partitioned into mil- lions of data units and allocated randomly across thousands of nodes, the location of a data unit changed dynamically too. These facts make the join query problem in cloud en- vironment more complex than that in distributed database. The arrangement of data transmission and local data pro- cessing is known as a distribution strategy for a query. Although generating an efficient distribution strategy for join query in the cloud is very challenging, some character- istics of cloud storage systems can be used. One of them is that data in cloud systems is replicated, so it is feasible to take the redundant data into consideration and pick out the sub-join queries that need not to transfer data. Besides, although different copies of the same data may be inconsis- tent, some applications such as Social Network Site(SNS) applications require weakly data consistency, thus the con- sistency problem is not in our scope of this paper. This paper focuses on the problem of generating an ef- ficient distribution strategy for two-way join query in the cloud, replication information is considered during the pro- cess in generating the final strategy and two novel algorithms are proposed: Set Cover[2] based algorithm(SC) and Mini- mum Element based algorithm(ME algorithm, it is inspired by the minimum element method in transportation prob- lem). In SC algorithm, all the candidate data units for the smaller table are accessed one by one, but all the data units for the bigger table are accessed together. For each data unit of the smaller table, with the help of virtual nodes, we transform the problem into a weighted Set Cover problem and use greedy algorithm to generate the final strategy. In ME algorithm, the node pairs whose cost to transfer a data unit between them are less do sub-join queries first. The contributions can be generalized as follows: Formulated the problem of using redundant data for join query in the cloud and proposed a suitable trans- mission cost model. The main difference between this work and some previous related work is that data was already replicated before the join query arrive, the replication operation is not driven by the query, it is just one of the characteristics for cloud storage sys- tems. Inspired by the query algorithm for select func- 15

Upload: yunpeng

Post on 25-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Efficient Data Distribution Strategy for Join QueryProcessing in the Cloud

Haiping WangSchool of Information

Renmin University of ChinaBeijing, China

[email protected]

Xiaofeng MengSchool of Information

Renmin University of ChinaBeijing, China

[email protected]

Yunpeng ChaiSchool of Information

Renmin University of ChinaBeijing, China

[email protected]

ABSTRACTThere are many advantages for large scale data managementin the cloud. More and more companies start to migratetheir data into cloud data management systems. Join querybecomes a challenging research problem in cloud. To fin-ish a join query in the cloud, data among different nodesneed to be transferred. The arrangement of data transmis-sion and local data processing is known as a distributionstrategy for a query. The transmission cost (network work-load between servers and the transmission time delay) willbe very high if the strategy is not properly chosen. Exist-ing cloud systems either do not support join query or justuse MapReduce to support some simple join queries. Theproblem of using redundant data for join query optimizationin cloud environment is studied in this paper. Two novelalgorithms, Set Cover based algorithm(SC) and MinimumElement based algorithm(ME), are proposed to reduce datatransmission cost. The experiment results demonstrate thatthe proposed methods can greatly reduce the data trans-mission cost compared with the naive method. Besides, theresult is very close to the optimal strategy.

Categories and Subject DescriptorsH.2.4 [Database Management]: Systems—Distributeddatabases; Query processing

General TermsAlgorithms

Keywordscloud computing, distribution strategy, replicate, join

1. INTRODUCTIONAs a new trend of data management, cloud computing has

many advantages, more and more companies start to mi-grate their data into cloud systems and sometimes do join

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CloudDB’11, October 28, 2011, Glasgow, Scotland, UK.Copyright 2011 ACM 978-1-4503-0956-1/11/10 ...$10.00.

query based on them. To finish a join query in distributedenvironment, data among different nodes need to be trans-ferred, the transmission cost(the size of data transferred be-tween servers and the transmission time delay) will not beaffordable if the join query plan is not properly arranged.What’s more, most of the cloud systems use Bigtable[1]data model to store data, tables are partitioned into mil-lions of data units and allocated randomly across thousandsof nodes, the location of a data unit changed dynamicallytoo. These facts make the join query problem in cloud en-vironment more complex than that in distributed database.The arrangement of data transmission and local data pro-cessing is known as a distribution strategy for a query.Although generating an efficient distribution strategy for

join query in the cloud is very challenging, some character-istics of cloud storage systems can be used. One of themis that data in cloud systems is replicated, so it is feasibleto take the redundant data into consideration and pick outthe sub-join queries that need not to transfer data. Besides,although different copies of the same data may be inconsis-tent, some applications such as Social Network Site(SNS)applications require weakly data consistency, thus the con-sistency problem is not in our scope of this paper.This paper focuses on the problem of generating an ef-

ficient distribution strategy for two-way join query in thecloud, replication information is considered during the pro-cess in generating the final strategy and two novel algorithmsare proposed: Set Cover[2] based algorithm(SC) and Mini-mum Element based algorithm(ME algorithm, it is inspiredby the minimum element method in transportation prob-lem). In SC algorithm, all the candidate data units for thesmaller table are accessed one by one, but all the data unitsfor the bigger table are accessed together. For each dataunit of the smaller table, with the help of virtual nodes, wetransform the problem into a weighted Set Cover problemand use greedy algorithm to generate the final strategy. InME algorithm, the node pairs whose cost to transfer a dataunit between them are less do sub-join queries first. Thecontributions can be generalized as follows:

∙ Formulated the problem of using redundant data forjoin query in the cloud and proposed a suitable trans-mission cost model. The main difference between thiswork and some previous related work is that data wasalready replicated before the join query arrive, thereplication operation is not driven by the query, it isjust one of the characteristics for cloud storage sys-tems. Inspired by the query algorithm for select func-

15

tion in existing cloud systems, the naive method wassummarized for join query in the cloud.

∙ Proposed two distribution strategies suitable for joinqueries in cloud storage systems: SC and ME algo-rithm. SC algorithm is suitable for the case when oneof the two join tables is much bigger than the otherone. ME algorithm works better when the size of thetwo join tables almost the same. The experiment re-sults show that if the proper strategy as proposed inthis paper is chosed, the data transmission cost can bereduced greatly.

The rest of this paper is organized as follows: the relatedwork is reviewed in Section 2, preliminaries and the costmodel are explained in Section 3; we describe the optimalstrategy in section 4 respectively; SC and ME algorithm isdescribed in Section 5 and 6; the experimental evaluation isin Section 7; the conclusion is given in Section 8.

2. RELATED WORKThere is few related work about using redundant data

to generate a distribution strategy for join query optimiza-tion in the cloud, most of existing cloud systems either donot support join query or just use MapReduce[3] to achievesome simple join query. In traditional distributed databaseenvironment, there is some previous work and we can di-vided them into two categories: Fragment and Replicatealgorithms(FR) and Symmetric Fragment and Replicate al-gorithms(SFR).HBase[4] and Cassandra[5] are two typical open source

cloud data management systems, the architecture of HBaseis Master/Slave while Cassandra is P2P, both of them havetheir own advantage, but they have in common that theyjust support some simple operations: 𝑝𝑢𝑡(), 𝑔𝑒𝑡(), 𝑑𝑒𝑙𝑒𝑡𝑒()etc. The query algorithm in HBase and Cassandra can beexpressed as follows: if the query is based on the rowkey, thesystem will find out all the candidate data units and handlethem one by one(for each data unit, fetching the nearestavailable one), in HBase, every region is a data unit, whilein Cassandra, the granularity of data units is a record; if thequery is not on the rowkey, scan the whole table.Both HBase and Cassandra use Bigtable data model.

Bigtable is a new data model proposed by Google, it mapstwo arbitrary string values (row key and column key) andtimestamp (hence three dimensional mapping) into associ-ated arbitrary byte array. It is not a relational database andcan be defined as a sparse, distributed multi-dimensionalsorted map. BigTable is designed to scale up to the petabyterange across hundreds or thousands of machines, and tomake it easy to add more machines to the system and au-tomatically take advantage of those resources without anyreconfiguration. Tables are optimized for GFS by splittinginto multiple tablets.Hive[6] is one of only a few cloud systems that support

join operation which is built on top of Hadoop[7]. All thedata files are stored in HDFS[7]. The replication informationis not visible to Hive. So Hive cannot take the replicatedinfo into consideration when doing join query. Hive usesMapReduce[3] job to sort the two join tables by the joinkey, and then do repartition join[8] to finish the join query.In distributed systems, previous papers[9], [10], [11] use

the cost model 𝐶(𝑋) = 𝐶0 + 𝐶1𝑋 to compute the cost tosend X amount of data from one site to another, where 𝐶0

and 𝐶1 are constants, and the cost to process a query is theresult of summation of costs for sending data among sites.C(X) is a monotonically increasing function, if 𝑋 ≤ 𝑌 , then𝐶(𝑋) ≤ 𝐶(𝑌 ).There exists some algorithms that use redundant data to

do join query optimization in traditional distributed andparallel database. Previous papers[12], [13], [14] use FRalgorithm to do join query optimization. They treat thetwo join tables as follows: the bigger one is partitioned byits primary key and each fragment is allocated in differentsites while all tuples of the smaller one are replicated andbroadcasted across the join sites. While in [15], the authorarranges the join nodes into a rectangle at first, then eachfragment of R is replicated across one row of join sites, whileeach fragment of S is replicated down one column of join sitesin the rectangle and uses two-way method to finish the joinquery, it is known as SFR algorithm.Both FR and SFR algorithms are driven by query, so they

are not suitable for the problem when doing join query inthe cloud since tables have already been partitioned andreplicated before the join query arrives. This is the first workto use existing redundant data to generate a distributionstrategy for join query optimization in the cloud.

3. PRELIMINARIES AND COST MODEL

3.1 ExampleTo help understand the strategies, we take Social Network

Site(SNS) application for an example. Supposing there aretwo tables stored in cloud storage systems:

UserInfo(userID, userName, otherUserInfo)UserStatus(updateTime, userID, Newstatus)

Table UserInfo uses the column userID as its rowkey whilethe rowkey of table UserStatus is composed of updateTimeand userID, the two tables are stored in two bigtables incloud systems and partitioned by their rowkey into blocks,each block is replicated and randomly allocated in differ-ent nodes in the cluster, the architecture of cloud systemis master/slave, the master manages all the meta info ofeach data blocks, including their replication information, theslaves store the data . Here comes a query as follow:

Query: Given a subset of userIDs, find out theuserNames and their latest status that had beenchanged in a given time period (2011-03 to 2011-05).

When it comes a join query task, the master let the slavenodes to do sub queries on the two join tables independentlywith its own query constraints and generate the candidatedata blocks for join query at first. Then generate the finaljoin query plan with one of the two distribution strategies byconsidering the size of the two join tables’ candidate data.Finally the master assign each sub-join query task to theproper slave node and let the slave nodes to finish the sub-join queries in parallel.

3.2 Problem formulationTo facilitate the description, throughout this paper, 𝑅(𝑆)

is used to stands for table UserInfo(UserStatus), what’s more,we use the following notations:

∙ 𝐵𝑅: a set stores all the candidate blocks of table 𝑅

16

∙ 𝑏𝑟: size of set 𝐵𝑅

∙ 𝐵𝑆: a set stores all the candidate blocks of table 𝑆

∙ 𝑏𝑠: size of set 𝐵𝑆

∙ 𝑁𝑅: a set maintains the nodes which have one or morecandidate blocks of 𝑅

∙ 𝑛𝑟: size of set 𝑁𝑅

∙ 𝑁𝑆: a set maintains the nodes which have one or morecandidate blocks of 𝑆

∙ 𝑛𝑠: size of set 𝑁𝑆

∙ 𝑁 : total number of join nodes

∙ 𝑅𝑖: the 𝑖𝑡ℎ element in set 𝐵𝑅

∙ 𝑆𝑗 : the 𝑗𝑡ℎ element in set 𝐵𝑆

∙ 𝑎𝑣𝑅𝑖: number of available replicas for 𝑅𝑖

∙ 𝑎𝑣𝑆𝑗: number of available replicas for 𝑆𝑗

∙ 𝑁𝑅𝑖,𝑘: the node that stores the 𝑘𝑡ℎ available replicaof 𝑅𝑖

∙ 𝑁𝑆𝑗,𝑘: the node that stores the 𝑘𝑡ℎ available replicaof 𝑆𝑗

∙ 𝑁𝑖,𝑗,𝑘: virtual node, means that 𝑅𝑖 should be trans-ported from the node which contains the 𝑘𝑡ℎ replicaof 𝑅𝑖 to node 𝑁𝑗

∙ 𝐶𝑜𝑠𝑡(𝑞): transmission cost for join query 𝑞

∙ 𝐶𝑜𝑠𝑡(𝑞,𝑅𝑖): transmission cost to finish all the sub-joinqueries related to 𝑅𝑖

∙ 𝑤(𝑁𝑖, 𝑁𝑗): cost weight to transport one unit data from𝑁𝑖 to 𝑁𝑗

∙ 𝑋(𝑁𝑖, 𝑁𝑗): size of data shifted between 𝑁𝑖 to 𝑁𝑗

∙ 𝑆𝐽𝑃𝑠(𝑜𝑝𝑡): number of sub-join query tasks assignedto each nodes in the optimal strategy

∙ 𝐶(𝑁𝑖): size of data fetched from other nodes by 𝑁𝑖

∙ 𝐿𝑆𝐽𝑃𝑠(𝑁𝑖): number of sub-join query tasks that 𝑁𝑖

can finish without fetching data from other nodes

∙ 𝐵𝑅(𝑁𝑖): the candidate blocks of 𝑅 stored in node 𝑁𝑖

∙ 𝐵𝑆(𝑁𝑖): the candidate blocks of 𝑆 stored in node 𝑁𝑖

∙ 𝑏𝑙(𝑁𝑖, 𝑁𝑗): 0 or 1: if we need to transfer data from 𝑁𝑖

to 𝑁𝑗 , then 𝑏𝑙(𝑁𝑖, 𝑁𝑗) = 1; Else 𝑏𝑙(𝑁𝑖, 𝑁𝑗) = 0

We firstly do a select operation on table UserInfo with thegiven userIDs and a range query on table UserStatus withthe given time constraints(2011-03 to 2011-05), the candi-date blocks and their distributions can be generated by theseoperations. The join operation on the column UserID canbe done based on the candidate blocks generated by previ-ous query results. During the join procedure, data needs tobe shifted between nodes.In our example, there are 8 slaves and 𝑁 = 8. After some

previous process, table 𝑅 has 5 the candidate blocks andtable 𝑆 has 6. The distribution of these blocks and theirreplicas across the slave nodes can be seen in Fig1: 𝐵𝑅 ={𝑅1, 𝑅2, 𝑅3, 𝑅4, 𝑅5}, 𝑏𝑟 = 5, 𝑁𝑅 = {𝑁1, 𝑁4, 𝑁6, 𝑁7, 𝑁8},𝑛𝑟 = 5, 𝐵𝑆 = {𝑆1, 𝑆2, 𝑆3, 𝑆4, 𝑆5, 𝑆6}, 𝑏𝑠 = 6, 𝑁𝑆 = {𝑁1, 𝑁2,𝑁3, 𝑁4, 𝑁5}, 𝑛𝑠 = 5.

S6S4N5

S5S1S3N2

R1S6N4S3S2 S1S4

N3S2 S1

R1S3R2 R3N1

R2R5R4N6

R4R3N7R5R3N8

Figure 1: The distribution of the candidate blocks

3.3 Cost modelIn this paper, we use a different cost model to better reflect

the characteristics of join query in cloud environment. Thetotal transmission cost to finish a given join query can beexpressed by formula1:

𝐶𝑜𝑠𝑡(𝑞) =

𝑛𝑟∑

𝑖=1

𝑛𝑠∑

𝑗=1

𝑤(𝑁𝑖, 𝑁𝑗) ∗𝑋(𝑁𝑖, 𝑁𝑗) (1)

𝐶𝑜𝑠𝑡(𝑞) is influenced by the following parameters:

∙ 𝑛𝑟, 𝑛𝑠. That is to say, the cost is influenced by thedistribution of the candidate data units.

∙ 𝑤(𝑁𝑖, 𝑁𝑗), the cost to transport one unit data fromnode 𝑁𝑖 to 𝑁𝑗 . 𝑤(𝑁𝑖, 𝑁𝑗) is mainly dependent on thelocation of 𝑁𝑖 and 𝑁𝑗 : if 𝑁𝑖 and 𝑁𝑗 are the samephysical node, then 𝑤(𝑁𝑖, 𝑁𝑗) = 0; if 𝑁𝑖 and 𝑁𝑗 aredifferent nodes in the same datacenter and the samerack, then 𝑤(𝑁𝑖, 𝑁𝑗) = 1; if 𝑁𝑖 and 𝑁𝑗 are differentnodes in the same datacenter but in different racks,then 𝑤(𝑁𝑖, 𝑁𝑗) = 2; if 𝑁𝑖 and 𝑁𝑗 are different nodesin the different datacenters, then 𝑤(𝑁𝑖, 𝑁𝑗) = 3. Inour example, the cost matrix between nodes can beseen in Table 1:

Table 1: CostMatrix𝑁𝑅 ∖𝑁𝑆 𝑁1 𝑁2 𝑁3 𝑁4 𝑁5

𝑁1 0 1 1 1 3𝑁4 1 1 1 0 1𝑁6 1 2 3 1 1𝑁7 1 3 2 1 2𝑁8 2 3 1 1 2

∙ 𝑋(𝑁𝑖, 𝑁𝑗), the size of data need to be transported be-tween 𝑁𝑖 and 𝑁𝑗 .

Based on the cost model above, the join query can be opti-mized in the following four steps:

∙ Before generating the final strategy, if the two tableshave some separate query constraints, do it firstly toreduce the candidate blocks number, in other word, toreduce the size of 𝐵𝑅 and 𝐵𝑆.

∙ During the process in generating the final strategy, se-lect out all the candidate sub-join block pairs that canbe done without data transmission at first. Then withthe two strategies proposed in this paper, minimize thenumber of blocks transported between nodes. What’smore, use the nearest replicas to chose the smaller𝑤(𝑁𝑖, 𝑁𝑗).

17

∙ Reduce the sub-join tasks by using sort-merge join.

∙ Reduce the size of transported data for each block bydoing select and project operation locally first.

4. THE OPTIMAL STRATEGYThe problem to generate the optimal strategy by using re-

dundant data in distributed database systems is NP-Hard[11],cloud data management system is a special distributed sys-tem, and the partition and replication strategy in cloud datamanagement systems is more complicated than that in tradi-tional distributed systems. To illustrate simply, some addi-tional assumptions as a previous survey paper[17] are madein this paper:

∙ All the nodes run sub-join tasks in parallel;

∙ Each node does sub-join query and data transmissionin parallel;

∙ Data is prepared before a sub join query arrives. Thisassumption is based on the fact that both table 𝑅 and𝑆 have several blocks in each candidate sub-join node,so when a data block is shifted to a node, the node cando several sub-join pairs with that data block.

The time cost by a given join query in the cloud equalsto the sub-join query time of the node that finishes its sub-join query tasks slowest. So to finish the join query moreefficiently, it works better to distribute all the sub-join querytasks across the nodes uniformly. 𝑆𝐽𝑃𝑠(𝑜𝑝𝑡) is defined inequation2

𝑆𝐽𝑃𝑠(𝑜𝑝𝑡) = ⌊ ∣ 𝐵𝑅 ∣ ∗ ∣ 𝐵𝑆 ∣𝑁

⌋ (2)

Where 𝑆𝐽𝑃𝑠(𝑜𝑝𝑡) means the number of sub-join queriesthat each join nodes should be assigned, ∣ 𝐵𝑅 ∣ ∗ ∣ 𝐵𝑆 ∣ isthe total number of sub-join query tasks. Based on the aboveassumptions, the optimal strategy can be expressed as fol-lows: all the sub-join query tasks should be divided into 𝑁parts, the number of sub-join query tasks that each node isassigned equals to 𝑆𝐽𝑃𝑠(𝑜𝑝𝑡) or 𝑆𝐽𝑃𝑠(𝑜𝑝𝑡)+ 1. The trans-mission cost of the optimal strategy is 𝐶𝑜𝑠𝑡(𝑜𝑝𝑡).It is difficult to get the exact value of 𝐶𝑜𝑠𝑡(𝑜𝑝𝑡), in order

to evaluate the performance of the two proposed algorithms,we give another definition of 𝑐, and use 𝑐 to do comparisoninstead of 𝐶𝑜𝑠𝑡(𝑜𝑝𝑡). Where

𝑐 = 𝑚𝑖𝑛(𝑤(𝑁𝑖, 𝑁𝑗)) ∗𝑁∑

𝑖=1

𝑐(𝑁𝑖) (3)

And

𝑐(𝑁𝑖) =

{⌊ 𝑆𝐽𝑃𝑠(𝑜𝑝𝑡)−𝐿𝑆𝐽𝑃𝑠(𝑁𝑖)

𝑀𝑎𝑥(∣𝐵𝑅(𝑁𝑖)∣,∣𝐵𝑆(𝑁𝑖)∣) ⌋ 𝑆𝐽𝑃𝑠(𝑜𝑝𝑡) − 𝐿𝑆𝐽𝑃𝑠(𝑁𝑖) > 0

0

(4)

For each node, to finish 𝑆𝐽𝑃𝑠(𝑜𝑝𝑡) sub-join query tasks withthe least transmission cost, we need to fetch 𝑐(𝑁𝑖) datafrom one or several of its ”nearest” nodes. As expressedin equation4: if the number of local sub-join pairs is equalto or bigger than 𝑆𝐽𝑃𝑠(𝑜𝑝𝑡), there is no need to fetch dataand 𝑐(𝑁𝑖) = 0; otherwise, 𝑐(𝑁𝑖) > 0.In our example 𝑆𝐽𝑃𝑠(𝑜𝑝𝑡) = 3, the number of blocks that

each node needs to fetch from other nodes can be expressedas Table 2:

Table 2: Number of blocks fetched from other nodes𝑁𝑜𝑑𝑒𝐼𝑑 𝑅𝑏𝑙𝑜𝑐𝑘 𝑆𝑏𝑙𝑜𝑐𝑘 𝐿𝑆𝐽𝑃𝑠(𝑖) 𝑐(𝑁𝑖)𝑁1 3 3 9 0𝑁2 0 3 0 1𝑁3 0 4 0 0𝑁4 1 1 1 2𝑁5 0 2 0 1𝑁6 3 0 0 1𝑁7 2 0 0 1𝑁8 2 0 0 1

It is obvious that

𝐶𝑜𝑠𝑡(𝑜𝑝𝑡) ≥ 𝑐 (5)

5. SC ALGORITHMThis section introduces SC algorithm(Set Cover[2] based

method), the first distribution strategy proposed for joinquery optimization in the cloud. In the following parts ofthis section, we will introduce Set Cover problem briefly atfirst, then transform the problem into a weighted Set Coverproblem with the help of virtual nodes, finally use greedyalgorithms to generate the final strategy.

5.1 Set Cover problemThe Set Cover problem is a classical question in computer

science and complexity theory. It can be expressed as fol-lows: given a set of elements 𝐸 = {𝑒1, 𝑒2, ..., 𝑒𝑛} and a set of𝑚 subsets of 𝐸, 𝑆 = {𝑆1, 𝑆2, ..., 𝑆𝑚}, find a ”least cost” col-lection 𝐶 of sets from 𝑆 such that 𝐶 covers all the elementsin 𝐸. That is, ∪𝑆𝑖∈𝐶𝑆𝑖 = 𝐸.Set Cover comes in two aspects, unweighted and weighted.

In unweighted Set Cover, the cost of the collection 𝐶 is num-ber of sets contained in it. In weighted Set Cover, there is anonnegative weight function 𝜔 : 𝑆 → 𝑅, and the cost of 𝐶is defined to be its total weight, 𝑖.𝑒.,

∑𝑆𝑖∈𝐶 𝜔(𝑆𝑖).

Although Set Cover problem is an NP-hard problem, thereare some greedy approximations algorithms that can be usedto solve the problem.

5.2 Problem transformationIn SC algorithm, all the candidate blocks of table 𝑆 are

kept in the nodes where they stayed and transport table𝑅′𝑠 data to finish the join query, 𝑅′𝑠 candidate blocks ofare accessed one by one while all the candidate blocks oftable 𝑆 are considered together. Suppose the cost to finishall the sub-join query tasks related to 𝑅𝑖 is 𝐶𝑜𝑠𝑡(𝑞,𝑅𝑖), thenthe cost model can be expressed in another form:

𝐶𝑜𝑠𝑡(𝑞) =𝑏𝑟∑

𝑖=1

𝐶𝑜𝑠𝑡(𝑞,𝑅𝑖) (6)

Where

𝐶𝑜𝑠𝑡(𝑞, 𝑅𝑖) =

𝑎𝑣𝑅𝑖∑𝑘=1

𝑛𝑠∑𝑗=1

𝑤(𝑁𝑅𝑖,𝑘, 𝑁𝑗) ∗ ∣ 𝑅𝑖 ∣ ∗ 𝑏𝑙(𝑁𝑅𝑖,𝑘, 𝑁𝑗) (7)

Let’s take 𝑅2 for example, to finish the whole join query, 𝑅2

needs to do sub-join queries with all the elements in 𝐵𝑆. Asshowed in Fig1, 𝑎𝑣𝑅2 = 2, 𝑁𝑅2,1 is 𝑁1, 𝑁𝑅2,2 is 𝑁6, then𝐶𝑜𝑠𝑡(𝑞,𝑅2) = ∣ 𝑅𝑖 ∣ ∗ [𝑤(𝑁1, 𝑁1) ∗ 𝑏𝑙(𝑁1, 𝑁1) + 𝑤(𝑁1, 𝑁2) ∗ 𝑏𝑙(𝑁1, 𝑁2) +𝑤(𝑁1, 𝑁3) ∗ 𝑏𝑙(𝑁1, 𝑁3) + 𝑤(𝑁1, 𝑁4) ∗ 𝑏𝑙(𝑁1, 𝑁4) +𝑤(𝑁1, 𝑁5) ∗ 𝑏𝑙(𝑁1, 𝑁5) + 𝑤(𝑁6, 𝑁1) ∗ 𝑏𝑙(𝑁6, 𝑁1) +

18

𝑤(𝑁6, 𝑁2) ∗ 𝑏𝑙(𝑁6, 𝑁2) +𝑤(𝑁6, 𝑁3) ∗ 𝑏𝑙(𝑁6, 𝑁3) +𝑤(𝑁6, 𝑁4) ∗ 𝑏𝑙(𝑁6, 𝑁4) +𝑤(𝑁6, 𝑁5) ∗ 𝑏𝑙(𝑁6, 𝑁5)]One way to reduce 𝐶𝑜𝑠𝑡(𝑞,𝑅2) is to reduce the destination

nodes that 𝑅2 should be transported to, that is to say, try tofind less nodes that contains all the table 𝑆′ blocks. What’smore, it is obvious that 𝑅2 can do sub-join queries in node𝑁1 with 𝑆1, 𝑆2, 𝑆3 without any data transmission.It is feasible to transform the problem into a weighted

Set Cover problem. Set {𝑆4, 𝑆5, 𝑆6} can be seen as set 𝐸 inweighted Set Cover problem, it has four subset: 𝑁2, 𝑁3, 𝑁4, 𝑁5,the cost weights 𝑤(𝑁1, 𝑁2), 𝑤(𝑁1, 𝑁3), 𝑤(𝑁1, 𝑁4), 𝑤(𝑁1, 𝑁5)are the weights in weighted Set Cover problem. So thedestination nodes can be calculated out by greedy algo-rithm in weight Set Cover problem. As it is showed inFig2, the destination nodes that 𝑅2 should be transportedto are 𝑁2, 𝑁3, 𝑁4, so 𝐶𝑜𝑠𝑡(𝑞,𝑅2) = ∣ 𝑅2 ∣ ∗ [𝑤(𝑁1, 𝑁2) +𝑤(𝑁1, 𝑁3) + 𝑤(𝑁1, 𝑁4)] = ∣ 𝑅2 ∣ ∗ [1 + 1 + 1] = 3 ∗ ∣ 𝑅2 ∣.

S6S4N5S5S1S3N2

R1S6N4S3S2 S1S4

N3S2S1

R2S3

N1Figure 2: Naive SC algorithm

The Naive SC algorithm does not make use of the replica-tion info of 𝑅2, that’s because if we take the replication infointo consideration, it is not a weighted Set Cover problemany more. To take the replication info into considerationand transform the problem into weighted Set Cover prob-lem, two types of virtual nodes are introduced: 𝑁𝑉𝑖 and𝑁𝑖,𝑗,𝑘. 𝑁𝑉𝑖 is a virtual node standing for all the nodes thatstore 𝑅𝑖(has a replica of 𝑅𝑖) while 𝑁𝑖,𝑗,𝑘 means 𝑅𝑖 should betransported from the node which contains the 𝑘𝑡ℎ replica of𝑅𝑖 to node 𝑁𝑗 . The number of virtual nodes 𝑁𝑖,𝑗,𝑘 dependson the size of 𝑁𝑆 and the number of available replicas of𝑅𝑖, the cost weight keeps the same. In our example, 𝑅2 has2 available replicas, the first replica is stored in 𝑁1 and thesecond in 𝑁6, 𝑁𝑉2 is 𝑁0, all the virtual nodes related to 𝑅2

can be seen in Fig3:

S5S1S3

N2,2,2

S6S4N2,5,1S3S2 S1S4

N2,3,2

S5S1S3

N2,2,1R1S6N2,4,2

R2R1S6N2,4,1

S6S4N2,5,2

S3S2 S1S4N2,3,1N0

Figure 3: SC algorithm and virtual nodesWith these two types of virtual nodes, the problem be-

comes a weighted Set Cover problem and we can use greedyalgorithm to calculate the destination nodes. In our exam-ple, as showed in Fig3, there are two destination nodes for𝑅2: 𝑁2,2,1 and 𝑁2,5,2. So the final transmission strategyis that 𝑁2 fetches a replica of 𝑅2 from 𝑁1 to finish sub-join query task (𝑅2, 𝑆5) and 𝑁5 fetches a replica of 𝑅2 from𝑁6 to finish sub-join query tasks (𝑅2, 𝑆4), (𝑅2, 𝑆6). Then𝐶𝑜𝑠𝑡(𝑞,𝑅2) = ∣ 𝑅2 ∣ ∗ [𝑤(𝑁1, 𝑁2)+𝑤(𝑁6, 𝑁5)] = ∣ 𝑅𝑖 ∣ ∗ [1+1] = 2 ∗ ∣ 𝑅2 ∣. SC algorithm can be expressed in algorithm1and algorithm2:

Algorithm 1 SC Algorithm

Input: Set 𝐵𝑆,𝑁𝑆,𝑁𝐹𝐵𝑆,𝑅𝑖;Output: 𝐶𝑜𝑠𝑡(𝑞,𝑅𝑖), 𝑇𝑆𝑡𝑟𝑎𝑡𝑒𝑔𝑦;

1: procedure

2: List 𝑅′=𝑅𝑖.𝑔𝑒𝑡𝑎𝑣𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑠();

3: for (𝑘 = 0; 𝑘 < 𝑅′.𝑠𝑖𝑧𝑒();𝑘 ++) do

4: Node 𝑁𝑅𝑖,𝑘=𝑅′.𝑔𝑒𝑡𝑁𝑜𝑑𝑒(𝑘);

5: if (𝑁𝑆.𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠(𝑁𝑅𝑖,𝑘)) then6: Set 𝐵𝑆𝑡𝑒𝑚𝑝 = 𝑁𝐹𝐵𝑆.𝑔𝑒𝑡(𝑁𝑅𝑖,𝑘);7: 𝐵𝑆 = 𝐵𝑆 −𝐵𝑆𝑡𝑒𝑚𝑝;8: 𝑁𝑆 = 𝑁𝑆 −𝑁𝑅𝑖,𝑘;9: 𝑇𝑆𝑡𝑟𝑎𝑡𝑒𝑔𝑦.𝑎𝑑𝑑(𝑁𝑅𝑖,𝑘, 𝑁𝑅𝑖,𝑘, 𝐵𝑆𝑡𝑒𝑚𝑝);10: end if11: end for12: 𝑁𝑅𝑖,𝑘 = 𝑛𝑢𝑙𝑙;13: 𝐵𝑆𝑡𝑒𝑚𝑝 = 𝑛𝑢𝑙𝑙;14: while (!𝐵𝑆.𝑖𝑠𝐸𝑚𝑝𝑡𝑦()) do15: Find 𝑁𝑅𝑖,𝑘, 𝑆𝑁𝑜𝑑𝑒 using algorithm2;16: 𝐵𝑆𝑡𝑒𝑚𝑝 = 𝑁𝐹𝐵𝑆.𝑔𝑒𝑡(𝑆𝑁𝑜𝑑𝑒);17: 𝐵𝑆 = 𝐵𝑆 −𝐵𝑆𝑡𝑒𝑚𝑝;18: 𝑁𝑆 = 𝑁𝑆 − 𝑆𝑁𝑜𝑑𝑒;19: for (𝑙 = 0; 𝑙 < 𝑁𝐹𝐵𝑆.𝑠𝑖𝑧𝑒(); 𝑙 ++) do20: 𝑁𝐹𝐵𝑆.𝑔𝑒𝑡(𝑙) = 𝑁𝐹𝐵𝑆.𝑔𝑒𝑡(𝑙)−𝐵𝑆𝑡𝑒𝑚𝑝;21: end for22: 𝐶𝑜𝑠𝑡(𝑞,𝑅𝑖) = 𝐶𝑜𝑠𝑡(𝑞) + 𝑤(𝑁𝑅𝑖,𝑘, 𝑆𝑁𝑜𝑑𝑒);23: 𝑇𝑆𝑡𝑟𝑎𝑡𝑒𝑔𝑦.𝑎𝑑𝑑(𝑁𝑅𝑖,𝑘, 𝑆𝑁𝑜𝑑𝑒,𝐵𝑆𝑡𝑒𝑚𝑝);24: end while25: end procedure

In algorithm1, 𝑁𝐹𝐵𝑆 is a map data structure that mapsa node(an element in set 𝑁𝑆) to a list of blocks(the blockswho have at least one available replica in this node), func-tion 𝑁𝐹𝐵𝑆.𝑔𝑒𝑡(𝑁𝑖) returns all the candidate blocks of table𝑆 that 𝑁𝑖 stores. 𝑇𝑆𝑡𝑟𝑎𝑡𝑒𝑔𝑦 manages the final data trans-mission strategy, function 𝑇𝑆𝑡𝑟𝑎𝑡𝑒𝑔𝑦.𝑎𝑑𝑑(𝑁𝑅𝑖,𝑘, 𝑆𝑁𝑜𝑑𝑒,𝐵𝑆𝑡𝑒𝑚𝑝) means that we should transport the 𝑘𝑡ℎ replica of𝑅𝑖 from node 𝑁𝑅𝑖,𝑘 to node 𝑆𝑁𝑜𝑑𝑒, for each element 𝑆𝑗

of set 𝐵𝑆𝑡𝑒𝑚𝑝, do sub-join query (𝑅𝑖, 𝑆𝑗) in node 𝑆𝑁𝑜𝑑𝑒.The body of algorithm1 from line 3 to line 10 is to checkout the all the sub-join queries that can be finished withoutany data transmission, the rest of algorithm1 is the core ofSC algorithm(the minimum function of weight Set Covergreedy algorithm is changed). The new minimum func-tion for weighted Set Cover greedy algorithm is expressedin algorithm2:It is notable that every time we get a candidate node

in 𝑁𝑆 in algorithm1 in the while loop, the values of set𝐵𝑆,𝑁𝑆,𝑁𝐹𝐵𝑆 are updated( from line 16 to line 20 inalgorithm1), so in algorithm2 at line 10, the denominator isthe number of blocks that needed to do sub-join with 𝑅𝑏𝑙𝑜𝑐𝑘;The virtual nodes are reflected in algorithm2 from line 7 toline 16.

6. ME ALGORITHMThe unit considered in SC algorithm is block pairs rather

than node pairs. ME algorithm consider the node pairs first.The key of a distribution strategy problem is to find outhow to transfer data between nodes, which is similar to thetransportation problem. ME algorithm is inspired by theminimum element method for the transport solution andhas four main steps:

19

Algorithm 2 New minimum function for Weighted SetCover Greedy algorithm

Input: Set 𝑁𝑆,𝑁𝐹𝐵𝑆,𝑅′;

Output: 𝑁𝑅𝑖,𝑘, 𝑆𝑁𝑜𝑑𝑒;

1: procedure2: Node 𝑁𝑅𝑖,𝑘, 𝑆𝑁𝑜𝑑𝑒 = 𝑛𝑢𝑙𝑙;3: float 𝑐𝑜𝑠𝑡 = 𝐹 𝑙𝑜𝑎𝑡.𝑀𝐴𝑋𝑉 𝐴𝐿𝑈𝐸;4: for (𝑖 = 0; 𝑖 < 𝑁𝑆.𝑠𝑖𝑧𝑒(); 𝑖++) do5: int 𝑆𝑁𝑜𝑑𝑒 = 𝑁𝑆.𝑔𝑒𝑡(𝑖);6: Set 𝑡𝑚𝑝𝑆𝐵𝑙𝑐𝑜𝑘 = 𝑁𝐹𝐵𝑆.𝑔𝑒𝑡(𝑆𝑁𝑜𝑑𝑒);

7: for (𝑘 = 0; 𝑘 < 𝑅′.𝑠𝑖𝑧𝑒(); 𝑘 ++) do

8: Node 𝑅𝑁𝑜𝑑𝑒 = 𝑅′.𝑔𝑒𝑡𝑁𝑜𝑑𝑒(𝑘);

9: if 𝑡𝑚𝑝𝑆𝐵𝑙𝑐𝑜𝑘.𝑠𝑖𝑧𝑒() > 0 then

10: float 𝑡𝑚𝑝𝑐𝑜𝑠𝑡=𝑤(𝑅𝑁𝑜𝑑𝑒,𝑆𝑁𝑜𝑑𝑒)𝑡𝑚𝑝𝑆𝐵𝑙𝑐𝑜𝑘.𝑠𝑖𝑧𝑒()

;

11: if 𝑡𝑚𝑝𝑐𝑜𝑠𝑡 < 𝑐𝑜𝑠𝑡 then12: 𝑆𝑁𝑜𝑑𝑒 = 𝑆𝑁𝑜𝑑𝑒;13: 𝑁𝑅𝑖,𝑘 = 𝑅𝑁𝑜𝑑𝑒;14: end if15: end if16: end for17: end for18: end procedure

Step1: Use set 𝐵𝑅𝑆𝑆𝑒𝑡 to generate and store all the sub-join block pairs, each block pair maps a sub-join query task.

Step2: Get the current cost matrix 𝑤(𝑁𝑖, 𝑁𝑗), do sortoperation for each node pair with the cost weight 𝑤. Storeall the node pairs in a queue 𝑄, the node pair with smallestweight 𝑤 is the head element, in the following steps, we getthe minimum element for each loop by popping an elementfrom queue 𝑄. In our example, as showed in table1, the firstelement in 𝑄 is either node pair (𝑁1, 𝑁1) or (𝑁4, 𝑁4).

Step3: For each node pair, generate all the sub-join blockpairs(queries) they can finish. 𝑈(𝑁𝑖, 𝑁𝑗) is used to man-age the sub-join block pairs(queries) for node pair (𝑁𝑖, 𝑁𝑗),where 𝑁𝑖(𝑁𝑗) is an element of set 𝑁𝑅(𝑁𝑆). For example,𝑈(𝑁7, 𝑁5) = {(𝑅3, 𝑆4), (𝑅3, 𝑆6), (𝑅4, 𝑆4), (𝑅4, 𝑆6)}.

Step4: The fourth step is a loop, pop the first elementfrom 𝑄, then update set 𝐵𝑅𝑆𝑆𝑒𝑡(suppose the first elementin 𝑄 is (𝑁𝑖, 𝑁𝑗), update 𝐵𝑅𝑆𝑆𝑒𝑡 to (𝐵𝑅𝑆𝑆𝑒𝑡−𝑈(𝑁𝑖, 𝑁𝑗))),continue this loop until 𝐵𝑅𝑆𝑆𝑒𝑡 or 𝑄 is empty.The upper four steps can help us generate the distribu-

tion strategy within a short time, however, it cannot tell ushow to transfer data between each node pair, so another set

𝑈′(𝑁𝑖, 𝑁𝑗) is introduced for improvement, where

𝑈′(𝑁𝑖, 𝑁𝑗) = 𝐵𝑅𝑆𝑆𝑒𝑡 ∩ 𝑈(𝑁𝑖, 𝑁𝑗) (8)

Count out the related blocks related with 𝑈′(𝑁𝑖, 𝑁𝑗) for

table 𝑅 and 𝑆 and store them in set 𝐵𝑅′and 𝐵𝑆

′, the

number of blocks that needed to be transported betweenthe two nodes is 𝑇𝐵(𝑁𝑖, 𝑁𝑗), where

𝑇𝐵(𝑁𝑖, 𝑁𝑗) = 𝑚𝑖𝑛{∣ 𝐵𝑅′ ∣, ∣ 𝐵𝑆′ ∣} (9)

For each node pair, transfer data from the nodes with less re-

lated blocks to the other. Update set𝐵𝑅𝑆𝑆𝑒𝑡 using 𝑈′(𝑁𝑖, 𝑁𝑗)

instead of 𝑈(𝑁𝑖, 𝑁𝑗),

𝐵𝑅𝑆𝑆𝑒𝑡 = 𝐵𝑅𝑆𝑆𝑒𝑡− 𝑈′(𝑁𝑖, 𝑁𝑗) (10)

In our example as showed in Fig1 and Table1, the first two

elements popped from 𝑄 is (𝑁1, 𝑁1) and (𝑁4, 𝑁4), the nodepair popped by the third loop is (𝑁1, 𝑁2). Take node pair

(𝑁1, 𝑁2) for example, 𝑈′(𝑁1, 𝑁2) = {(𝑅1, 𝑆5), (𝑅2, 𝑆5), (𝑅3,

𝑆5)} (Although node pair (𝑁1, 𝑁2) can do other sub-joinpairs such as (𝑅1, 𝑆1), (𝑅2, 𝑆1), (𝑅3, 𝑆1), (𝑅1, 𝑆3), (𝑅2, 𝑆3), (𝑅3, 𝑆3), they had already been assigned by node pair (𝑁1, 𝑁1)

in previous loop), so 𝐵𝑅′= {𝑅1, 𝑅2, 𝑅3}, 𝐵𝑆′

= {𝑆5} and𝑇𝐵(𝑁1, 𝑁2) = 1, so block 𝑆5 should be transported from𝑁2 to 𝑁1.There is possibility that most of the sub-join pairs are

assigned to one or several nodes in the cluster. So anotherfurther improvement is made to avoid this.

We define a threshold 𝑡ℎ, another queue 𝑄′, an array

𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚𝑏𝑒𝑟. 𝑡ℎ is used to balance the number of sub-join tasks among nodes and it equals to 𝑆𝐽𝑃𝑠(𝑜𝑝𝑡); Array𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚𝑏𝑒𝑟 is used to count for the number of sub-join tasks that have been assigned to the node previously,the length of the array equals to the number of join nodes inthe cluster, the initial value of the array are set to 0. Thenstep 4 is extended as follows:For each loop, get a node pair 𝑒 from queue𝑄, suppose the

node pair is (𝑁𝑖, 𝑁𝑗), check the value of 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚𝑏𝑒𝑟[𝑁𝑖]and 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚𝑏𝑒𝑟[𝑁𝑗 ] firstly, there are three types of re-sults and each is accessed in different ways:

∙ If both 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚𝑏𝑒𝑟[𝑁𝑖] < 𝑡ℎ and 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚𝑏𝑒𝑟[𝑁𝑗 ] < 𝑡ℎ, access as the previous way.

∙ If both 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚𝑏𝑒𝑟[𝑁𝑖] ≥ 𝑡ℎ and 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚

𝑏𝑒𝑟[𝑁𝑗 ] ≥ 𝑡ℎ, do nothing but push 𝑒 into 𝑄′.

∙ Else assign the sub-join tasks to the node whose as-signed number is less than 𝑡ℎ and transport the relateddata blocks from the other node.

Continue the loop until 𝐵𝑅𝑆𝑆𝑒𝑡 or 𝑄 is empty.

If 𝐵𝑅𝑆𝑆𝑒𝑡 or 𝑄′is not empty, then start another loop,

this loop is similar with the loop in step 4. The differencebetween them is that after we pop an element from queue

𝑄′, we just assigned the related sub-join pairs to the node

whose assigned number is smaller and fetch data blocks fromthe other node.

7. EXPERIMENT AND ANALYSISThis section compares some experiment results of SC and

ME algorithm with the optimal strategy and the naive method.The naive method is inspired by existing query algorithmsin cloud systems.

7.1 The naive methodThe naive method to do join query in cloud systems is

very simple. It is somewhat like the replication selectionwhen get operation is executed in the cloud. For each blockpair (𝑅𝑖, 𝑆𝑗) ready for sub-join query, pick out the availablereplicas of the two blocks at first. Since all the replicas of thesame block store in different nodes, there are 𝑎𝑣𝑅𝑖 availablenodes and each has a replica of block 𝑅𝑖 and it is the samefor block 𝑅𝑖, so there are 𝑎𝑣𝑅𝑖 ∗ 𝑎𝑣𝑆𝑗 node pairs that canfinish the sub-join query task for block pair (𝑅𝑖, 𝑆𝑗), chosethe node pair with the least cost weight, if the cost equalsto 0 (block 𝑅𝑖 and 𝑆𝑗 have at least one available replicasstored in the same physical node), the sub-join query can befinished without any data transmission, otherwise transportblock 𝑅𝑖 or 𝑆𝑗 to the corresponding nodes. Repeat theseuntil all the sub-join query tasks are assigned.

20

replicationFactor=3NodeNumber=256

Size(S)=64GB

replicationFactor=3NodeNumber=256

Size(S)=64GB

R:S=1/1000 R:S=10/1000 R:S=100/1000

R:S=1000/1000

Optimal 7 8 10 10

Naïve 28 148 315 976

SC 187 1092 5390

ME 52 163 681 2096

0

2000

4000

6000

decision time(ms)

R:S=1/1000 R:S=10/1000

R:S=100/1000

R:S=1000/1000

Optimal 1 2301 25668 227793

Naïve 2911 30832 305706

SC 545 5630 54877 542837

ME 583 6811 82121 588841

0200000400000600000800000

transmission cost

13476

1033331

Figure 4: Performance evaluation experiment

1 2 3 4 5 6

Optimal 691869 340503 227795 163689 127179 102009

Naïve 4238436 3409356 3033331 2853000 2726435 2601358

ME 1008532 760035 588841 489771 430996 384874

0

1000000

2000000

3000000

4000000

5000000

tras

miss

ion

cost

1 2 3 4 5 6

Optimal 8409 3801 2301 1584 1122 735

Naïve 42434 35459 30832 28885 27640 26230

SC 10850 7571 5630 4494 3694 3218

0

10000

20000

30000

40000

50000

tran

smiss

ion

cost

NodeNumber=256Size(S)=64GB

R:S=0.01

NodeNumber=256Size(S)=64GB

R:S=1

Figure 5: The effect of the replication factor

7.2 Environmental setupOur testing infrastructure includes 20 machines which are

connected together to simulate cloud computing platforms- 1 master and 19 slaves. Each node contains an Intel Core2 2.33GHz CPU, 8GB of main memory and 2TB hard disk.The OS is as follows: Ubuntu 9.10. Communication band-width is 1Gbps. To simulate cloud characteristics, we usethis infrastructure to simulate a cloud cluster with 100 nodesto 600 nodes. We conducted four types of experiments:

∙ Performance evaluate experiment. Compare the strat-egy decision time and the transmission cost with theoptimal strategy and the naive method to evaluate theperformance of the two distribution strategies.

∙ Replication factor impact performance comparison ex-periment. By changing the replication factor from 1to 6, we find out how the replication factor affects theperformance of the two algorithms.

∙ Expansibility experiment. Evaluate the scalability ofthe two distribution strategies by scaling up the num-ber of nodes from 100 to 600 in the cluster.

∙ Effectiveness experiment. Evaluate the efficiency ofthe two distribution strategies by scaling up the datasize of the two join tables, the size of bigger table isscaled up from 64GB to 384GB.

The experiment dataset is telecom CDR(Call Detail Record)data and has two tables 𝑅(𝐶𝑜𝑙𝐴,𝐶𝑜𝑙𝐵) and 𝑆(𝐶𝑜𝑙𝐶, 𝐶𝑜𝑙𝐴),where 𝐶𝑜𝑙𝐴,𝐶𝑜𝑙𝐶 are long type and 𝐶𝑜𝑙𝐵 is a string. 𝐶𝑜𝑙𝐴is the primary key of table 𝑅 and 𝐶𝑜𝑙𝐶 is the primary keyof table 𝑆, all the tuples in table 𝑅(𝑆) is sorted by its pri-mary key 𝐶𝑜𝑙𝐴(𝐶𝑜𝑙𝐶) and then use bigtable model to splitinto several data units on its primary key, the size of eachdata unit is 64MB. Each data unit has several replicas andrandomly allocated across all the slaves in the cluster, themaster manages all the data units’ metadata: location info,replication info and key range. To make the experiment

more representative, the cost to transfer one data unit be-tween nodes is set to 3,4,5. The size of table 𝑆 is biggerthan 𝑅. The join query is Select ColC, ColB From R, SWhere R.ColA > S.ColA and R.ColA < L. After we sendthe query to the master node, the master use the metadatato find out the candidate data units and then use the dis-tribution strategies proposed by this paper to generate thequery plan and assign the sub-join query tasks to the slavesto finish the query.It is notable that generating the optimal strategy is NP-

hard, we use 𝑐 to represent the transmission cost of the op-timal strategy in all the experiments. The experiments areprogrammed and tested in java language.

7.3 Results and analysisAs showed in Fig4, the transmission cost of both the SC

and ME algorithm are much less than the naive method,and SC algorithm works even better than ME algorithm.Unfortunately, the decision time of SC algorithm is much

longer, especially when 𝑚𝑖𝑛(∣𝐵𝑅∣,∣𝐵𝑆∣)𝑚𝑎𝑥(∣𝐵𝑅∣,∣𝐵𝑆∣) equals to 1. We can

conclude from the upper experiment results that when the

ratio( 𝑠𝑖𝑧𝑒𝑜𝑓(𝑅)𝑠𝑖𝑧𝑒𝑜𝑓(𝑆)

) increased, especially when it equals to 1, ME

algorithm is more suitable than SC algorithm, otherwise SCalgorithm works better. Based on this result, in the nextthree groups of experiments, we compared our two strategieswith the optimal strategy and the naive method separately,when evaluate the SC algorithm, the ratio is set to 0.01, theratio is 1 when it comes to ME algorithm.One thing must be point out is that the decision time of

the optimal strategy is the time for calculating 𝑐, it is notthe actually decision time of the optimal strategy since it isNP-hard to generate the optimal strategy, in the followingparts of this paper, we use 𝑐 to do comparison instead of theoptimal strategy.The result of the replication factor’s effect on the perfor-

mance of SC and ME algorithm can be seen in Fig5. The re-sults show no matter what the value of the replication factoris set, our two strategies perform much better than the naive

21

method and is close to the optimal strategy. What’s more,as the replication factor increased, the transmission cost de-creases. The rate of decline decreased too as the system’sreplication factor increased, especially when the replicationfactor up to 3, which is strikingly similar to the existingcloud data management system’s settings.

100 200 300 400 500 600

Optimal 82059 153945 240105 315477 498027 469113

Naïve 2862626 3009580 3047387 3071855 3077821 3091543

ME 236718 443125 633272 813734 1075277 1075664

0

750000

1500000

2250000

3000000

3750000

tran

smiss

ion

cost

100 200 300 400 500 600

Optimal 777 1803 2910 3942 6408 7311

Naïve 28712 29837 30383 30150 30770 30578

SC 2609 4439 5874 7351 7523 9285

0

7500

15000

22500

30000

37500

tran

smiss

ion

cost

ReplicationFactor=3Size(S)=64GB

R:S=0.01

ReplicationFactor=3Size(S)=64GB

R:S=1

Figure 6: Scale up by node number

100 200 300 400 500 600

Optimal 82059 153945 240105 315477 498027 469113

Naïve 2862626 3009580 3047387 3071855 3077821 3091543

ME 236718 443125 633272 813734 1075277 1075664

0

750000

1500000

2250000

3000000

3750000

tran

smiss

ion

cost

100 200 300 400 500 600

Optimal 777 1803 2910 3942 6408 7311

Naïve 28712 29837 30383 30150 30770 30578

SC 2609 4439 5874 7351 7523 9285

0

7500

15000

22500

30000

37500

tran

smiss

ion

cost

ReplicationFactor=3Size(S)=64GB

R:S=0.01

ReplicationFactor=3Size(S)=64GB

R:S=1

Figure 7: Scale up by data size

Fig6 and Fig7 illustrate the scalability and efficiency ofME and SC algorithm. These graphs show that ME andSC algorithm work efficiently and they scale almost linearlyto the table size or the number of nodes. The transmis-sion costs of SC and SE algorithm keep close to the optimalmethod’s cost.

8. CONCLUSION AND FUTURE WORKIn this paper, we propose two novel efficient distribution

strategies suitable for join query in the cloud: SC and MEalgorithm. We take the redundant data into considerationto reduce the transmission cost. Based on some reasonableassumptions, we give a definition of the optimal strategyand find out its transmission cost is bigger than 𝑐. We alsodo some experiment evaluations to compare SC and MEalgorithm with the optimal strategy and the naive method,the results shows that our strategies can greatly reduce thetransmission cost for join query in cloud systems.For future work, as the size of the two tables increase, the

decision time of the SC algorithm increase too, what’s more,since we treat the blocks of table 𝑅 independently, so we canimprove our SC algorithm by using parallel computing to re-duce the decision time; For ME algorithm, as the number ofslave nodes increases, although our query processing algo-rithm has very good scalability, it cost more memory duringthe computation, we will introduce buffer management al-gorithms to reduce the memory cost.Our strategies is suitable for cloud systems whose archi-

tecture are master/slave and use bigtable data model to ar-range data, so we want to bring our algorithms into HBasesystem and use MapReduce jobs to make HBase to supportjoin query, our two algorithms can be applied to the Map-task assignment phase, shuffle phase and the join phase. Wewill also make a trade-off between the transmission cost andquery time for join queries in the cloud.

9. ACKNOWLEDGEMENTSThis research was partially supported by the grants from

the Natural Science Foundation of China (No.60833005, 61070055, 91024032), the Fundamental Research Funds for theCentral Universities, and the Research Funds of RenminUniversity of China (No. 10XNI018), National Science andTechnology Major Project (No. 2010ZX01042-002-003).

10. REFERENCES[1] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.

Wallach, M. Burrows, T. Chandra, A. Fikes, andR. Gruber, “Bigtable: A distributed storage system forstructured data,” in Proceedings of the 7th Conferenceon USENIX Symposium on Operating Systems Designand Implementation, Seattle, Washington, November2006, pp. 205–218.

[2] Set cover problem:http://en.wikipedia.org/wiki/Set_cover_problem

[3] J. Dean and S. Ghemawat, “Mapreduce: simplified dataprocessing on large clusters,” Communications of theACM, vol. 51, pp. 107–113, January 2008.

[4] HBase: http://hadoop.apache.org/hbase/

[5] Cassandra: http://cassandra.apache.org/

[6] Hive: http://hive.apache.org/

[7] Hadoop: http://hadoop.apache.org

[8] Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, JunRao, Eugene J. Shekita, Yuanyuan Tian: A comparisonof join algorithms for log processing in MaPreduce.SIGMOD 2010:975-986

[9] J. M. Chang, ”A heuristic approach to distributedquery processing,” in Proceedings of the 8th VLDB,Mexico City, Mexico, Sept. 1982.

[10] A. Hevner and S. B. Yao, ”Query processing indistributed database systems,” IEEE Trans. SoftwareEng., vol. SE-5, pp. 177- 187, May 1979.

[11] C. T. Yu, K. Lam, C. C. Chang, and S. K. Chang, ”Apromising approach to distributed query processing,” inProceedings of Berkeley Conf: Distributed Data Base,Feb. 1982, pp. 363-390.

[12] Clement T. Yu, Chin-Chen Chang, MarjorieTempleton, David Brill, Eric Lund: Query Processing ina Fragmented Relational Distributed System: Mermaid.IEEE Trans. Software Eng. (TSE) 11(8):795-810 (1985)

[13] Vassilis Stoumpos, Alex Delis: Fragment and replicatealgorithms for non-equi-join evaluation on Smart Disks.ISADS 2009:471-478

[14] Chang-Hung Lee, Ming-Syan Chen: DistributedQuery Processing in the Internet: Exploring RelationReplication and Network Characteristics. ICDCS2001:439-446

[15] A Symmetric Fragment and Replicate Algorithm forDistributed Joins In IEEE Transactions on parallel anddistributed systems, VOL 4, NO. 12, DECEMBER 1993

[16] S. Ghemawat, H. Gobioff, and S.-T. Leung, “Thegoogle file system,” in Proceedings of SOSP’03, NewYork, USA, December 2003, pp. 29–43.

[17] Donald Kossmann: The State of the art in distributedquery processing. ACM Comput. Surv. (CSUR)32(4):422-469 (2000)

22