[ieee 2010 international conference on multimedia information networking and security - nanjing,...

A Dynamic Data Storage Architecture for SaaS

Wu Shengqi, Zhang Shidong, Kong Lanju School of Computer Science and Technology

Shandong University Jinan, China

[email protected], [email protected], [email protected]

Abstract—In the implementation of Software as a Service (SaaS), Universal Table and Column Store have become the two most typical data storage architecture. However, they both have obvious drawbacks. Microsoft proposed the schema based on Basic-Table combined with Extension-Table (BT&ET) ，in which some of the tenants’ common fields are stored into the basic table to improve the processing efficiency. But the structure of the basic table is irreversible for it’s defined by the service provider in their development stage. Thus none of the extension fields can be stored into the basic table, even if they are accessed much more frequently than the common columns they still need tuple reconstruction. In the paper we improve the BT&ET schema and based on the improved schema we propose a dynamic self-adaptive algorithm. Based on the tenants’ constantly need on data access, we can store some tenant’s frequently accessed extension fields into the basic table by our algorithm.

Keywords-SaaS; Dynamic Data Storage Architecture; BT&ET; Dynamic Self-adaptive Algorithm.

I. INTRODUCTION A well designed SaaS application should be extendable,

configurable and efficient with Multi-tenants. Compared to the traditional service, SaaS has many unique features. Multi-tenancy is one of key characteristics of SaaS. With the increase of the tenants’ number and the count of each tenant’s extension fields, the data storage of multi-tenant has become a challenge in SaaS implementation.

In Microsoft white paper [1] Frederick Chong et al propose a novel schema based on Basic-Table combined with Extension-Table to support Multi-tenants’ data storage. Tenants’ common fields are stored into Basic-Table, while the extension fields are stored into the Extension-Table. The tag and data type of each extension field is stored into the metadata table.

We can access the data in Basic-Table by usual method with high efficiency. But to the data in Extension-Table, tuple reconstruction will be involved and the processing efficiency is low. Worse still, if tenants access the extension data frequently but seldom access the Basic-Table, the advantage of the Basic-Table disappears.

In order to make good use of the Basic-Table, we wish we can store some of the tenants’ extension fields which are accessed frequently into the Basic-Table. Then we can access these data without tuple reconstruction and the efficiency will be higher. However, the schema of Basic-Table is defined by the SaaS provider in the development

stage and only contains a few of tenants’ common fields which the SaaS provider has known. So it’s hard to store tenants’ variety extension fields into Basic-Table.

In this article, we improve the Basic-Table combined with Extension-Table data storage architecture to store some tenants’ extension data into the Basic-Table. We create the Basic-Table with some reserved fields, but store tenants’ extension data into the Extension-Table first. And then we use our dynamic self-adaptive algorithm to transfer some of the tenants’ extension data which satisfy our standard into the Basic-Table. So these extension data can be accessed in the basic table and the query performance will be improved.

The article is structured as follows. Section II discusses related work. Section III outlines the improved BT&ET data storage architecture. Section IV describes the dynamic self-adaptive algorithm. Section V presents the results of our experiments, and it’s followed by conclusions and future work in Section VI.

II. RELATED WORK A multi-tenant database system for SaaS should offer

schemas that are possible to dynamically evolve the base schema and its extensions while the database is online [3]. Three solutions: the independent database, the shared database with independent schema and the shared database with shared schema are described in [4] to build a multi-tenant database, as illustrated in Figure 1.

Figure 1. Multi-tenant database system

The most basic way is Private Table Layout. In this approach, tenants have their own private instance of the base tables that are extended as required. In the database there’s no metadata, and the query-transformation layer only need to replace tables’ name. Although it offers good data isolation and security, it only appropriate for some larger services with a small number of tenants.

Extension Table Layout is based on Private Table Layout, and integrates tenants’ common data into a basic table. Tenant’s extensions are vertically partitioned into their own separate extension tables that are joined to the base tables via a row ID column. The base table should add a tenant column to separate from others. But the number of table’s will

2010 International Conference on Multimedia Information Networking and Security

978-0-7695-4258-4/10 $26.00 © 2010 IEEE

DOI 10.1109/MINES.2010.71

297


978-0-7695-4258-4/10 $26.00 © 2010 IEEE

DOI 10.1109/MINES.2010.71

297


978-0-7695-4258-4/10 $26.00 © 2010 IEEE

DOI 10.1109/MINES.2010.71

297


978-0-7695-4258-4/10 $26.00 © 2010 IEEE

DOI 10.1109/MINES.2010.71

297


978-0-7695-4258-4/10 $26.00 © 2010 IEEE

DOI 10.1109/MINES.2010.71

292


978-0-7695-4258-4/10 $26.00 © 2010 IEEE

DOI 10.1109/MINES.2010.71

292

increase with the tenant’s number since more tenants will have a wider variety of basic requirements.

The most mature and popular solution is the shared database with shared schema, which aims at creating only once the application schema and mapping all tenants directly to this schema by making use of one of the available schema mapping techniques. Based on relational databases, a variety of mapping techniques have been used to solve multi-tenancy problems in the shared mode.

Universal table is a generic structure with a Tenant column, a Table column, and a large number of generic data columns. The data columns have a flexible type, such as VARCHAR, into which other types can be converted [2, 9]. It can support tenants’ arbitrary extension and does not need to do a lot of aligning joins for reconstruction. However, indexing on universal table is not effective and it contains a large number of null values, which make a negative impact on performance. To resolve these problems, in [4] Mei Hui et al propose a multi-tenant database system called M-Store, which uses two techniques: Bitmap Interpreted Tuple (BIT) which is used to resolve the null value problem and Multi-Separated Index (MSI) which improves performance by creating indexes for tenant’s own data on frequently accessed attributes. However, it does not support the extensibility issue.

The XML support has been implemented on several commercial database systems such as IBM’s pure XML [5]. The base table has a column to store a flat XML document which contains all extension fields. The XML document is always loosely typed for tenants’ various custom information. So the decrease of the performance is proportional to the extension fields’ number, for the parsing of the loosely typed documents and the reassemble of type rows.

In Pivot Tables, each value is stored along with an identifier for its column in a tall narrow table [6], such as Google BigTable [7]. It need not to handle many null values, and supports more flexible extensions and index. However, the drawback of this approach is that it has more columns of meta-data than actual data and the overhead incurred is on-the-fly tuple reconstruction for multi-attribute queries which requires (n-1) aligning joins to building an n-column logical table [2, 10].

Chunk Folding and Chunk Tables are described in [2]. Using Chunk Folding logical tables are vertically partitioned into chunks that are folded together into different physical multi-tenant tables and joined as needed. Chunk Table is like a Pivot Table, but it reduces the rate of meta-data and the overhead of tuple reconstruction by using a set of data columns of various types and Chunk column.

Microsoft uses basic table combined with extension table mapping technique in [1, 2], and we will discuss it in detail in the following session. In [3], Stefan Aulbach et al conclude that the ideal database system for SaaS has not been developed, and offer some suggestions as to how it should be designed.

III. THE IMPROVED BT&ET DATA STORAGE LAYOUT

A. The Outline of the Original BT&ET Layout In Microsoft white paper [1] Frederick Chong et al

outline basic table combined with extension table schema-mapping technique for implementing multi-tenancy. The basic table stores tenants’ common data, and the extension table stores name-value pair of the extension fields. And the extension data are joined to the base tables along a row ID column. Compare to the universal table layout, it has a better extensibility, because the number of extension columns shouldn’t exceed the reserved columns’ amount in the universal table. In contrast with column store, it can provide a better performance with its basic table.

The following Figure 2 describes an example of our practical application.

Figure 2. Example of BT&ET layout

As illustrated in Figure 2, tenant35 extends two columns named Salary and Age in the table whose id is 1. In Basic Table every row of tenant35’s data in table 1 has a global id by which we can find the extension data in extension Table.

In their business if they access the Salary and Age fields very frequently and seldom access the data in basic table, the advantage of basic table lost. In our data storage architecture we will store the frequently accessed data Salary and Age into the basic table to improve the performance.

B. The Improvement on the BT&ET Layout In order to make good use of the Basic-Table, we wish

we can store some of the tenants’ extension fields which are accessed frequently into the Basic-Table. So we can access them without tuple reconstruction and the efficiency will be increased. On the basis of the BT&ET layout, we make the following changes.

1. We add the other three columns in the ColumnMetaData table:

• Count: Recording the number of this column being accessed.

298298298298293293

• IsInBasic: Recording whether this column has been transferred into the Basic-Table.

• BColumnName: Recording the column name of the Basic-Table if this column has been transferred into the Basic-Table.

We use the former one column to record the accessed history of the extension fields and determine whether the columns should be transferred into the basic table. The later two columns are used to record the details of the transferred columns.

2. In the Basic-Table, in addition to the basic common columns, we reserve some columns which will be used to store extension data. So the database has to handle many null values, but commercial relational databases can handle nulls fairly efficiently. The reserved columns have a flexible data type, such as VARCHAR, into which other types can be converted. The number of the reserved columns we set it to 60 in our design and we will find a suitable method to generate the proper value in our future work. After our modification the database structures as illustrated in Figure 3.

Figure 3. The database structure

As shown in Figure 3, the database structure is divided into three parts: the Metadata area, the Data Tables Area and the Index Tables area. The Metadata Area maintains information of extensible tables, tenants’ extension columns and some other metadata. We record the reserved columns which tenant has used in TableMetaData table. The Data Tables Area stores business data in basic table and extension table. In Index Tables area we store the index of the database.

IV. IMPLICATION OF DYNAMIC SELF-ADAPTIVE ALGORITHM

In this paper, we want to transfer some of the tenants’ extension fields which are accessed frequently into the Basic-Table to improve the performance. Based on our

improved database structure we propose the dynamic self-adaptive algorithm. In this section, we will describe the formula and some details of the algorithm firstly, and then outline the algorithm in detail.

A. Note and Formula for the Algorithm In our system, we store the metadata of the extension

field into the ColumnMetaData table. And initialize the count with 0, IsInBasic with NO, and BColumnName with NULL. Tenant’s extension business data are still stored into the extension table.

We set a clock point in our platform. For example, 0:00 PM of every Sunday. When the time arrives, a new thread is created to run the algorithm, but none of the user is aware of it.

In our algorithm, we need transfer extension data from extension table into basic table or from basic table into extension table, but users are unaware of the transfer and they still access the data normally. So we need to maintain the consistency of the data during the transfer.

We do three things during the data transformation from extension table into the basic table. First, we give the extension column a symbol which indicates it is beening transferred into the basic table. Then we copy the extension data from the extension table into the basic table. During the copying, if a user do query on this extension column, we provide the tenant with the data in the extension table. But when update occurs, we first update the data in extension table and then update the data in the basic table. After the copying we need to remove this extension column form the extension table. Before removing we give the extension column another symbol which indicates this column has been transferred into basic table completely and all operations will be done on the basic table. The last thing we need to do is to remove the extension data from the extension table and modify the metadata of this extension column. After that this extension column has been transferred into basic table.

When extension data transferred from the basic table into extension table, we also need to maintain the consistency of data. Just like what we do above, first we assign this extension column a symbol which indicates it is beening transferred from basic table into extension table. And then we copy the extension data from basic table into extension table. During the copying, we do query on the basic table. But when update occurs we first update data in the basic table and then update data in the extension table. After the copy, we assign the extension column another symbol which indicates it has been transferred into the extension table completely and all operations will be done on extension table. And the last thing is to remove this extension column from the basic table and modify the metadata of this extension column.

In the algorithm we need to compute the weight of every tenant’s extension field with the following Evaluation Formula:

Weight = a ∗ ( α / β ) + b ∗ γ. (1)

299299299299294294

In the formula a, b are parameters and a + b = 1; α, β and γ are variables.

α: stands for the amount of this extension field has been accessed.

β: stands for the amount of the extensible table which the previous extension field belongs to has been accessed.

γ: stands for the Service Level Agreement (SLA) of the tenant. If a tenant has a higher service level we need to do more to supply them with higher performance.

The parameter a, b stands for the proportion of the variables respectively. They are both given by the SaaS provider.

B. Implication of the Dynamic Self-adaptive Algorithm In the realization of our dynamic self-adaptive algorithm,

we use the evaluation formula to compute the weight of every tenant’s extension columns first, and then we find the extension column (EColumn) which has the largest weight in every tenant’s extensible table and still stored in the extension table. If there is more than one column has the largest weight, we just choose the first one we meet. And then determine whether there is a free reserved column which can be used for this table, if yes, we transfer this extension column into Basic-Table smoothly. Or we will find an extension column (SEColumn) which has been transferred into the basic table and has the minimum weight in this extensible table. Also when more than one column has the same minimum weight, we choose the first one we meet. And then we compare the weight of EColumn and SEColumn, if the weight of EColumn is larger, we first transfer the data of SEColumn into extension table and then transfer the data of EColumn into the basic table. Or we do nothing and continue with the next extensible table. We show the algorithm in the following figure.

Figure 4. The dynamic self-adaptive algorithm

In our experiment we set a=70%, b=30% and the standard evaluation value is 1. We use 1, 2 and 3 to stand for the level of every tenant’s SLA. That means if a tenant has the highest level of SLA his γ is 3. Before the dynamic self-adaptive algorithm runs for the first time, the parameters are shown in table I.

TABLE I. THE PARAMETERS

TenantID Total Count SLA Value

Tenant17 504 3 Tenant35 156 2 Tenant42 621 1

We show the changed metadata in figure 5 and the changed basic table in figure 6 after the running of the algorithm. The changed data are both marked in the box.

Figure 5. The changed ColumnMetaData table

Figure 6. The changed Basic-Table table

V. EXPERIMENT In this section, we empirically evaluate the performance

of our dynamic self-adaptive algorithm and the efficiency of our data storage architecture through a set of experiments.

We first evaluate the query performance, and compare it with the original. In our experiments, we initialize the basic table with 10 common columns and additional 60 reserved columns. Figure 7 shows that when extension columns have been transferred into the basic table, the query performance will be improved. And with the increase of extension columns and the size of data set, the efficiency improvement is more obvious.

In our second experiment we test the performance of our dynamic self-adaptive algorithm. We initialize the basic table with 10 common columns and different number of reserved columns. The lines in figure 8, from the bottom up, represent the time cost of 1000, 3000, 5000, 8000 and 10000 rows been involved in the algorithm separately. The results show that the cost changes very little with the number of the reserved columns, but when more rows involved, the overhead will be larger. But we needn’t to worry about it, because when the algorithm is running users can also access the data normally and they are unaware of the algorithm.

300300300300295295

Figure 7. Comparison of query performance.

Figure 8. Performance of dynamic self-adaptive algorithm

VI. CONCLUSIONS AND FUTURE WORK In this paper, we improved the Basic-Table combined

with Extension-Table data storage architecture to store some tenants’ extension data which are accessed frequently into the Basic-Table. So we can query them without tuple reconstruction and the performance will be increased. On the basis of our data layout we propose the dynamic self-adaptive algorithm, by which we can realize the data transfer.

During the transfer we update data not only in basic table but also in extension table to maintain the consistency of the data.

In our novel data storage architecture we have to reserve some columns in the Basic-Table, and the number of the reserved columns is set by ourselves. In our future work, we plan to establish a metadata statistical model to generate the proper number from the history business data.

ACKNOWLEDGMENT This work is supported by the National Natural Science

Foundation of China under Grant No.90818001, the Natural Science Foundation of Shandong Province of China under Grant No.2009ZRB019YT; No.2009ZRB019RW, the Natural Science Foundation of Shandong Province of China under Grant No.Y2007G24, Key Technology R&D Program of Shandong Province under Grant No.2009GG10001002, and Independent Innovation Foundation of Shandong University under Grant No.2009TS030.

REFERENCES [1] F Chong, G Carraro, R Wolter, “Multi-Tenant Data Architecture,”

MSDN Library, Microsoft Corporation, 2006. [2] Stefan Aulbach, Torsten Grust, Dean Jacobs, Alfons Kemper, Jan

Rittinger, “Multi-Tenant Databases for Software as a Service: Schema-Mapping Techniques,” SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.

[3] Stefan Aulbach, Dean Jacobs, Alfons Kemper, Michael Seibold, “A Comparison of Flexible Schemas for Software as a Service,” SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA.

[4] Mei Hui, Dawei Jiang , Guoliang Li, Yuan Zhou, ”Supporting Database Applications as a Service,” IEEE 2009.

[5] C.M.Saracca, D.Chamberlin, and R.Ahuja, DB29: “pure XML – Overview and Fast Start,” IBM, http://ibm.com/redbooks, 2006.

[6] Rakesh Agrawal, Amit Somani, Yirong Xu, “Storage and Querying of E-Commerce Data,” In VLDB, pages 149-158, 2001.

[7] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.Hsieh, Deborah A.Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, “Bigtable: A Distributed Storage System for Structured Data,” ACM Trans.Comput.Syst, 26(2), 2008.

[8] D. Kornack and P. Rakic, “Cell Proliferation without Neurogenesis in Adult Primate Neocortex,” Science, vol. 294, Dec. 2001, pp. 2127-2130, doi: 10. 1126/science. 1065467.

[9] Craig D Weissman, Steve Bobrowski, “The Design of the Force.com Multitenant Internet Application Development Platform,” SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA.

[10] Stratos Idreos, Martin L. Kersten, Stefan Manegold, “Self-organizing Tuple Reconstruction in Column-stores,” SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA.

301301301301296296

[ieee 2010 international conference on multimedia information networking and security - nanjing,...

Documents