[ieee 2008 19th international conference on database and expert systems applications (dexa) - turin,...

A Combined Selection of Fragmentation andAllocation Schemes in Parallel Data Warehouses

Soumia BenkridINI

Algiers - AlgeriaEmail: s [email protected]

Ladjel BellatrecheLISI/ENSMA - Poitiers University

Futuroscope 86960 FranceEmail: [email protected]

Habiba DriasUSTHB

Algiers - AlgeriaEmail: h [email protected]

Abstract— The process of designing a parallel data warehousehas two main steps: (1) fragmentation and (2) allocation ofgenerated fragments at various nodes. Usually, fragmentation andallocation tasks are used iteratively (we first split the warehousehorizontally and then allocate fragments over the nodes). Themain drawback of such design approach (called iterative) isthat it does not take into account the interdependencies betweenfragmentation and allocation since the generated fragments arethe input of data allocation problem. In this paper, we considera parallel data warehouse design approach combining datafragmentation and allocation. Its main characteristic is that itdecides on the quality of the allocation schema when fragmentingthe warehouse. Our approach is validated using computationaltests over a variety of parameter values.

I. INTRODUCTION

Data warehouse applications manage large amount of datain order to improve decision making process. Querying andmanaging these data become a crucial performance issue.Without optimization techniques, queries may take hours ordays to run. This is due to the high complexity of queries. Avariety of optimization techniques were proposed for relationaldata warehouses in the literature and supported by commercialdatabase systems. We can cite materialized views, advancedindexing schemes, data partitioning, clustering and parallelprocessing. While the first four optimization techniques havebeen investigated extensively in data warehousing environ-ment, parallel query processing tailored for data warehouseshas received very little attention in the research community,except the works done by [1], [2]. Designing a parallel passesby three steps: (i) data warehouse fragmentation, (ii) allocationof fragments and (iii) query processing. Data fragmentationprocess consists in partitioning the warehouse schema intoa set of horizontal fragments. Data allocation is the processof assigning fragments to nodes of parallel machine. Finally,global queries shall be rewritten on the fragments in order toensure a high performance of queries.

By exploring the most important works on designingparallel database (or data warehouse), we figure out twomain categories: (1) data partitioning oriented works and (2)data allocation oriented works. In the first category, workswere concentrated on developing algorithms for partitioningdatabase (data warehouse) schema horizontally or vertically[3], [4], [5], but they did not consider the data allocationproblem. In the second category, works were concentrated

on how allocating generated fragments (or tables of nonpartitioned databases) over nodes of parallel (or distributed)machines [2], [6]. Fragmentation and allocation problems weretreated in isolation (or iteratively). The main drawback of theiterative design of parallel data warehouse is that it does nottake into account the interdependencies between fragmentationand allocation. Note that the generated fragments are the inputof data allocation problem. To combine these two problems,an approach may consist in deciding the quality of generatedfragmentation schema of the warehouse based on its allocationprocess. In other words, at partitioning time, a decision of thequality of the allocation schema is taken. To do so, we needto use the same cost model for partitioning and allocationprocesses.

In our study, we focus on relational data warehouses basedon a star schema. It consists of a huge fact table and multipledimension tables. Queries executed on the top of this schema,typically perform aggregations on the fact table based onselections among the available dimension levels. Fragmentinga data warehouse consists mainly in partitioning the fact table.To do so, we have proposed a methodology to partition itusing fragmentation schemes of dimension tables [4] 1. Theconsequence of such fragmentation is that the star schema willbe broken into a set of sub star schemes which will be allocatedover nodes efficiently. Our approach combines fragmentationand allocation processes (see Figure 1).

Data FragmentationFragment Allocation

Fragmentation &AllocationSchemes

DatabaseSet of Queries

Data FragmentationFragment Allocation

Fragmentation &AllocationSchemes

DatabaseSet of Queries

Fig. 1. Steps of Combined Design Approach

The remainder of the paper is organized as follows. In thenext section, we mention related work on designing paralleldata warehouses and we show their limitations. Section 3

1Fragmentation schema of a table is the result of partitioning process.

19th International Conference on Database and Expert Systems Application

1529-4188/08 $25.00 © 2008 IEEE

DOI 10.1109/DEXA.2008.63

370

presents the main steps of our combined approach: data frag-mentation and data allocation. Section 4 presents performanceresults of various experiments for the proposed approach.Section 5 concludes the paper by summarizing the main resultsand suggesting future work.

II. RELATED WORK

The problem of designing parallel databases was largelystudied in the context of traditional databases [6]. Only few ofstudies were proposed in designing parallel data warehouses[2], [1]. In [2], a Multi-dimensional Hierarchical Fragmen-tation and allocation method, called, MDHF is developedfor relational data warehouse. To elaborate the allocationprocess, Stohr et al. [2] partitioned fact table based on thefragmentation schemes of dimension tables. This work con-sidered only point fragmentation of each dimension table,where each value range consists of exactly one attributevalue of a fragmentation attribute2. The authors imposedthat each fragmentation attribute shall belong to a hierarchyof a dimension table. The proposed fragmentation approachselected a set of fragmentation attributes from the dimensionattributes, at most one attribute per dimension table [2]. Thisapproach is very restricted, since in the context of relationaldata warehouse, dimension tables may be fragmented usingseveral attributes belonging or not to a hierarchy. In orderto ensure a high performance of queries, [2] selected bitmapjoin indexes on fragmented data warehouse. This selection isdone using non fragmentation attributes. This work has beenvalued by the development of a tool, called, WARLOCK [7]. Ithas been developed to automatically determine a parallel datawarehouse’s allocation to disk. It uses an internal cost modeland heuristics to determine a disk allocation minimizing bothinputs/outputs and query response times. Data allocation isdone using guidelines.

Furtado [1] shows the shortcomings of basic placementthat consists in horizontally partitioning the fact table andallocating the generated fragments using a round-robin orrandom distribution, while the much smaller dimensions arefully replicated into the nodes. He proposed a method basedon hash partitioning of the non-small dimensions using theirprimary keys. All these works did not combine the problemof fragmenting and allocating of data warehouse.

III. A COMBINED SELECTION OF FRAGMENTATION AND

ALLOCATION SCHEMES

In this section, we propose a method combining selectionof fragmentation and allocation schemes. During the fragmen-tation selection, our method decides whether this generatedfragmentation schema is interesting for the allocation process.If no, it searches another fragmentation schema (since therea large number of fragmentation schemes candidate [4]). Inthe case, where the fragmentation schema is feasible for dataallocation, an algorithm is developed to assign fragments atvarious nodes.

2A fragmentation attribute is an attribute participating in the fragmentationprocess.

Horizontal partitioning 3 is the core of parallel data ware-house design. In the context of relational data warehouses, itallows tables, indexes and materialized views to be partitionedinto disjoint sets of rows and columns that are physicallystored and accessed separately [8]. It allows tables, material-ized views and indexes to be decomposed into disjoint sets ofrows (called fragments) physically stored and usually accessedseparately. Most of today’s commercial database systems offernative DDL (data definition language) support for defininghorizontal partitions of a table [8].

There are two versions of horizontal partitioning [3]: pri-mary and derived. Primary horizontal partitioning of a relationis performed using attributes defined on that relation. Thisfragmentation may reduce query processing cost of selections.Derived horizontal partitioning, on the other hand, is the frag-mentation of a relation using attribute(s) defined on anotherrelation(s). In other word, the derived horizontal partitioningof a table is based on the fragmentation schema of anothertable(s). The derived partitioning of a table R based on thefragmentation schema of S is feasible if and only if there is ajoin link between R and S (R contains a foreigner key of S).

A. Methodology to Partition the Relational Data Warehouse

In the context of relational data warehouses, we proposedin [4], a methodology to partition different tables of starschema (dimension and fact): partition some/all dimensiontables using the primary horizontal partitioning (this parti-tioning may be virtual) and then partition the facts tableusing the fragmentation schemes of the fragmented dimensiontables. This methodology takes into consideration the starjoin queries requirements, but it may generate an importantnumber of horizontal fragments of the fact table (denoted by

N ) N =

g∏

i=1

mi, where mi and g are the number of fragments

of the dimension table Di and the number of dimension tablesparticipating in the fragmentation process, respectively. Thisfragmentation technique generates a large number of fragmentsof the fact table.

For example, suppose we have: Customer dimension tablepartitioned into 50 fragments using the State attribute 4, Timeinto 36 fragments using the Month attribute, and Product into80 fragments using Package type attribute, therefore the facttable will be fragmented into 144 000 fragments (50×36×80).In order to control the number of generated fragments of thefact table, we offer DBA the possibility to set this number inorder to facilitate the data allocation process.

Consequently, we formalize the problem of selecting hori-zontal partitioning schema as an optimization problem: Givena data warehouse schema {D1, ..., Dd, F} to fragmentedbased on a representative workload Q = {Q1, Q2, ..., Qn}executed on a no sharing machines with M nodes ND ={N1, N2, ..., NM} and a constraint (maintenance bound W )representing the number of sub star schemes (fragments)

3We use fragmentation and partitioning words interchangeably.4case of 50 states in the U.S.A.

371

that DBA that considers relevant for the allocation pro-cess. The horizontal partitioning selection problem consistsin fragmenting the fact table F into N fragments basedon fragmentation schemes of dimension tables, such that∑

Qj∈Q fQj× Cost(Qj) is minimized and N ≤ M , where

fQjand Cost(Qj) represent the access frequency of the query

Qj and the cost of evaluating Qj in the parallel machine,respectively.

B. Horizontal Partitioning Selection Process

Note that every fragmentation algorithm needs applicationinformation defined on the tables that have to be partitioned.The information is divided into two categories [3]: quantitativeand qualitative. Quantitative information gives the selectivityfactors of selection predicates and the frequencies of queriesaccessing these tables (Q = {Q1, ..., Qn}). Qualitative infor-mation gives the selection predicates defined on dimensiontables. Before performing fragmentation, the following tasksshould be done [4]: (1) Extraction of all simple predicatesdefined on dimension tables used by the n queries, (2) assign-ment to each dimension table Di(1 ≤ i ≤ d), its set of simplepredicates (SSPDi), (3) each dimension table Di havingSSPDi = φ cannot participate on the partitioning process.Let Dcandidate be the set of dimension tables having a non-empty SSPDi. Let g be the cardinality of Dcandidate (g ≤ d),and (4) use the COM MIN algorithm [3] to each dimensiontable Di of Dcandidate. Completeness and minimality statesthat a relation is partitioned into at least two fragments whichare accessed differently by at least one application [3]. Thisalgorithm takes a set of simple predicates and then generatesa set of complete and minimal predicates.

The fragmentation of each dimension table Dj inDcandidate is based on partitioning domain of each selectionattribute of Dj . To illustrate this domain partitioning, supposethat the domain values of attributes Age and Gender ofdimension table CUSTOMER and Season of dimension tableTIME are:Dom(Age) = ]0, 120], Dom(Gender) = {‘M’, ‘F’}, andDom(Season) = {“Summer”, “Spring”, “Autumn”, “Winter”}.We assume that DBA splits domains of these attributes intosub domains as follows:Dom(Age) = d11∪d12∪d13, with d11 = ]0, 18], d12 = ]18, 60[,d13 = [60, 120]. Dom(Gender) = d21∪d22, with d21 = {‘M ′},d22 = {‘F ′}. Dom(Season) = d31 ∪ d32 ∪ d33 ∪ d34, whered31 = {“Summer”}, d32 = {“Spring”}, d33 = {“Autumn”},and d34 = {“Winter”}.Different sub domains of all three fragmentation attributes arerepresented in Figure 2.

��

��

��

��

� ��

��

��

��

��

� ��

Fig. 2. An Example of Sub domains

C. Coding Fragmentation Schema

Domain partitioning of different fragmentation attributesmay be represented by multidimensional arrays, where eacharray represents the domain partitioning of a fragmentationattribute. The value of each cell of a given array representingan attribute ADk

i belongs to [1..ni], where ni represents thenumber of sub domain of the attribute ADk

i . Based on this rep-resentation, fragmentation schema of each table is generatedas follows: (1) If all cells of a given attribute have the differentvalues this means that all sub domains will be considered inpartitioning of corresponding dimension table. (2) If all cellsfor a given attribute have the same value this means that theattribute will not participate in the fragmentation process. (3)If some cells of a given attribute have the same value thentheir corresponding sub domains will be merged into one. Themerged cells shrink their associated sub domains. The obtainedsub domain is called a merged sub domain. Table I gives an

TABLE I

AN EXAMPLE OF CODING OF PARTITIONING SCHEMA

Gender 1 2Season 1 2 3 3

Age 1 1 2

example of coding of a fragmentation schema based on threeattributes Gender, Season and Age. Since, the CUSTOMERand TIME have been partitioned into 4 and 3 fragments,respectively; fact table is then partitioned into 12 partitions.

The above coding is used by our genetic algorithm inorder to represent any solution (representing a fragmentationschema). It may suffer from multi-instantiation, where a frag-mentation schema may be represented by multiple coding. Thisproblem can be solved using Restricted Growth Functions [9].

D. Our Fragmentation Algorithm

To select horizontal partitioning schema, we extend thegenetic algorithm proposed in a centralized data warehouseenvironment [?]. This extension concerns two points: (i) theused of Restricted Growth Functions and (ii) the modificationof the fitness function. For each generated fragmentationsolution, our algorithm verifies if it is feasible for dataallocation process (this verification is done by the fitnessfunction). If yes, it keeps this solution and then performsgenetic operators such as crossover, mutation, otherwise, itconsiders another solution. An outline of this algorithm is asfollows:

Generate initial population ;Perform selection step;iteration number:= 0;while iteration number < Max Iteration do

evaluation of generated solution according to allocationPerform crossover step;Perform mutation step;iteration number:= iteration number + 1;

end while.

372

The initial population may be generated randomly. In ourstudy, we did not advocate this generation, where sub domainsare merged till satisfying the maintenance constraint. Mutation,crossover, selection are similar to those proposed in [4]. Sinceour fragmentation algorithm is combined with the allocationprocess, the fitness function shall take into account this combi-nation. In the next section, we present our allocation process.

E. Allocation Process

Contrary to the most existing solutions of data allocation,where allocation unit represents a fragment; while in ourapproach is a sub star schema. The allocation problem maybe formulated as follows: given a set of sub star schemes S ={S1, ..., SN}, a set of queries Q = {Q1, ..., Qn} executingon a set of nodes of nothing shared machine, where eachquery has an access frequency. The sub star schemes allocationproblem consists in assigning these schemes at various nodessuch as query processing cost of all queries will be minimized.

For this study, we assume that dimension tables are repli-cated over all nodes and resided in the main memory. Thestructures required for our allocation process are:

1) Sub star schema usage matrix (SSUM ): it indicatesthe usage of sub star schemes according to the set ofqueries. This matrix contains queries as rows and substar schemes as columns. The value SSUMij (1 ≤ i ≤n, 1 ≤ j ≤ N) is equal 1 if the query Qi uses a substar schema Sj , otherwise, 0.

2) Allocation matrix FAM : it represents the allocationschema of sub star schemes over nodes. Each value ofthis matrix is defined as follows: famij = 1 if the substar schema Si is allocated at node Nj , 0 otherwise.Since we are considering a non redundant allocation,∑N

i=1famij = 1.

Now, we have all ingredients to present our allocation al-gorithm. For each generated fragmentation schema, the substar schema usage matrix SSUM is generated. Based onSSUM , an affinity matrix between pairs of sub star schemesis constructed as follows: The rows and columns of thismatrix represent sub star schemes generated by fragmentationalgorithm. Each value of this matrix represents the sum ofaccess frequencies of queries accessing simultaneously thetwo sub star schemes (it is a symmetric matrix). In orderto generate groups of sub star schemes, we adapt Navatheet al. algorithm used to vertically partition relational tablesusing a graphical approach [10]. This adaptation concernsthe manner of choosing cycles, where sub star schemes aregrouped based on their low affinities, contrary to Navathe etal. where attributes with high affinities form a cycle. Thisgrouping increases parallelism between nodes. The algorithmgives us a set of cycles C = {C1, ..., CH}, where each onerepresents a sub set of sub star schemes.

To allocate sub star schemes over nodes, we a use a roundrobin strategy, where instead of allocating fragments (as inmost of the existing works), cycles are allocated. Once thisallocation established, we set the fragment allocation matrix(FAM) as follows:

Inputs: C = {C1, ..Ck} set of cycles, Mindice node← 0for each class Ci of C do

for each element e of Ci doif e ∈ Ci then

FAM [e][indice node]← 1else

FAM [e][indice node]← 0end if

end forindice node← indice node + 1if indice node = M then

indice node← 0end if

end for

This allocation schema is evaluated by the fitness function ofthe genetic algorithm as follows:∑n

k=1

∑M

j=1

∑N

i=1SSUMki×FAMij×|Fi|, where n, M ,

N and |Fi| represent the number of queries, nodes, sub starschemes generated by genetic algorithm and the number ofpages required for storing the fact fragment Fi.

Finally, the genetic algorithm should minimize the followingfunction:minimize(

∑n

k=1max

∑M

j=1

∑N

i=1SSUMki × FAMij ×

|Fi|)

IV. EXPERIMENTAL STUDIES

In this section, we show results of our experiments onAPB-1 benchmark [11]. The star schema of this benchmarkhas one fact table Actvars (24786000 tuples, with a width =74) and four dimension tables: Prodlevel (9 000 tuples, witha width = 72), Custlevel (900 tuples, with a width = 24),Timelevel (24 tuples, with a width = 36), and Chanlevel (9tuples, with a width = 24). This warehouse has been populatedusing the generation module of APB1. Our simulation softwarewas built using Java performed under a Pentium IV 1,5Ghz microcomputer (with a memory of 384 Mo). We haveconsidered 55 queries. Each query has selection predicates,where each one has its selectivity factor. The crossover andmutations rates used in our experiments are 80% and 20%.The threshold representing the number of fragments that DBAconsider a relevant for data allocation process is 100 (for thethree first experiments).

Fig. 3. Combined Approach vs. Iterative Approach

In Figure 3, we have compared performance of iterativeand combined approaches, by varying the number of nodesof the parallel machine. For each variation the number of

373

IOs is computed. We observed that the combined approachoutperforms the iterative which goes exponentially, when thenode number increases, while the combined approach has astable behaviour. In Figure 4, we have compared our approach

Fig. 4. Combined Approach vs. Round Robin

with round robin placement which is the most used in theliterature by varying the number of nodes and computing thecost of each allocation. For round robin approach, we usethe same principle of fragmentation, but instead of allocatingcycles, we allocate fragments in a round robin fashion. Thisexperiment shows that our approach outperforms round robinfor small number of nodes, but they are quite similar for largenumber of nodes, with a small advantages of our approach.

Figure 5 studies the speed up of our approach. It showsthat this factor is not linear since it does not consider the loadbalancing problem.

In last experiment, we consider a parallel machine with10 nodes and we try to identify the relevant number of factfragments ensuring high performance. To do so, we vary thisnumber from 50 to 400 and for each value, we run ourcombined algorithm and we compute the cost of the generatedsolution. Figure 6 shows that the fragmentation schema of thewarehouse into 200 sub star schemes is well adapted for amachine with 10 nodes.

Fig. 5. Speed up of our approach

V. CONCLUSION

In this paper, we have proposed an approach for simultane-ously fragmenting a relational data warehouse modelled usinga star schema and allocating the generated fragments at variousnodes of a shared nothing parallel database machine. Most ofthe existing works on parallel database design considered thedata fragmentation and partition allocation problems in isola-tion and they did not exploit the interdependencies between

Fig. 6. Multi-instantiation of the coding

these two problems. During the fragmentation process, ourapproach decides whether the generated partitioning schemais relevant for data allocation process. We used a geneticalgorithm for the fragmentation process. For the allocationprocess, we use a variety of round robin approach, whereinstead of allocating single fragment at a node, we allocate asub set of cycles representing sub star schemes. Our approachis evaluated using data set of APB1 benchmark and thepreliminary results are promising. Additional and large scaleexperiments are underway to check our findings.

We plan to extend this work into two directions: (i) takinginto account the load balancing during the allocation processand (ii) allocating optimization techniques such as materializedviews and indexing over nodes.

REFERENCES

[1] P. Furtado, “Experimental evidence on partitioning in parallel datawarehouses,” in DOLAP, 2004, pp. 23–30.

[2] T. Stohr, H. Martens, and E. Rahm, “Multi-dimensional database allo-cation for parallel data warehouses,” Proceedings of the InternationalConference on Very Large Databases, pp. 273–284, 2000.

[3] M. T. Ozsu and P. Valduriez, Principles of Distributed Database Systems: Second Edition. Prentice Hall, 1999.

[4] L. Bellatreche, K. Boukhalfa, and H. I. Abdalla, “Saga: A combination ofgenetic and simulated annealing algorithms for physical data warehousedesign,” in 23rd British National Conference on Databases, no. 212-219,July 2006.

[5] S. Navathe, K. Karlapalem, and M. Ra, “A mixed partitioning methodol-ogy for distributed database design,” Journal of Computer and SoftwareEngineering, vol. 3, no. 4, pp. 395–426, 1995.

[6] K. Karlapalem and N. M. Pun, “Query driven data allocation algorithmsfor distributed database systems,” in 8th International Conference onDatabase and Expert Systems Applications (DEXA’97), Toulouse, Lec-ture Notes in Computer Science 1308, pp. 347–356, September 1997.

[7] T. Stohr and E. Rahm, “Warlock: A data allocation tool for parallelwarehouses,” Proceedings of the International Conference on Very LargeDatabases, pp. 721–722, 2001.

[8] A. Sanjay, V. R. Narasayya, and B. Yang, “Integrating vertical and hor-izontal partitioning into automated physical database design,” Proceed-ings of the ACM SIGMOD International Conference on Management ofData, pp. 359–370, June 2004.

[9] A. Tucker, J. Crampton, and S. Swift, “Rgfga: An efficient representationand crossover for grouping genetic algorithms,” Evol. Comput., vol. 13,no. 4, pp. 477–499, 2005.

[10] S. Navathe and M. Ra, “Vertical partitioning for database design : agraphical algorithm,” ACM SIGMOD, pp. 440–450, 1989.

[11] O. Council, “Apb-1 olap benchmark, release ii,”http://www.olapcouncil.org/research/resrchly.htm, 1998.

374

[ieee 2008 19th international conference on database and expert systems applications (dexa) - turin,...

Documents