automatic physical design of databases systems: an ...ddash/thesis-proposal.pdf · automatic...

Automatic Physical Design of Databases Systems:

An Optimization Theory Approach

(Thesis Proposal)

Debabrata Dash

[email protected]

Thesis Committee

Anastasia Ailamaki (Chair)

Christos Faloutsos

Carlos Guestrin

Guy Lohman (IBM Almaden Research Lab)

1

1 Introduction

Database systems are too large, too complex, and too dynamic for a non-expert user

to understand all the features available to obtain the most out of the system. One

of the most complex tasks for the database administrator is physically designing the

database to make it perform optimally for a query workload. The complexity of

physical design emanates from the availability of a large number of options, such as,

indexes, partitions, materialized views, different workload modeling techniques, and

different data layouts. The space of possible physical design features is huge, and to

make matters worse, the interaction between these features is difficult to model even

for sophisticated users.

This thesis proposes algorithms to automate database physical design. We use

optimization techniques to provide near-optimal solutions to the physical design prob-

lem. We show that our solutions are better than existing solutions and run order of

magnitude faster than current tools for large workloads.

This thesis states that database physical design problems can be modeled as convex

optimization problems and solved efficiently and scalably, without using heuristics.

We demonstrate the validity of the thesis statement by testing our proposed algo-

rithms on a variety of workload models (offline workload, online workload), a variety

of data layouts (row-oriented, column-oriented), for structured as well as unstruc-

tured, data. Finally, as a case study, we use our algorithms to aid astronomers in real

challenges they face in everyday life, such as managing large unstructured simulation

data, tracking objects over time, and querying complex spatio-temporal relationship

between objects. Using such large spectrum of databases, we show that our algorithms

are generic enough to be applicable to new types of databases as well.

2 Background and Related Work

In this section we discuss the existing research on physical design selection. First, we

discuss the need for physical design and then discuss different ways to address the

problem.

2.1 Database Design Process

Database design process usually comprises the following steps [19]:

2

1. Requirement Analysis to determine the data, the relationship between the data

elements, and the queries required by the software on top of the data.

2. Logical Design to determine the conceptual model of the database from the user

requirements and the database tables. The tables and their relationships are

usually shown in an entity-relationship (ER) or an Unified Modeling Language

(UML) diagram.

3. Physical Design to ensure that the queries on the database are executed ef-

ficiently. The database designer needs to build auxiliary structures, such as,

indexes, materialized views, and partitions, to ensure the efficiency. In rest of

the thesis, we denote these structures as design features. Unlike the logical de-

sign step, this step involves both the data and the queries determined in the

requirement analysis phase.

4. Monitoring and Refinement is an iterative step, where the database is monitored

for change in user requirement, or data characteristic. Once a change is detected,

the logical and physical design steps may be repeated. Typically the physical

design is repeated more often than logical design, as the workload tends to

change faster than the data characteristic.

The logical design phase mainly involves gathering the data dependencies. Once

the dependencies are determined, determining the right schema is straightforward

using one of the available automated tools [28].

The physical design process, however, is the most challenging one, since (1) the

number of possible indexes, materialized views, and partitions can be exponential

with respect to the size of the tables; (2) their effects are very often interrelated.

This has lead to many different automated physical design tools such as IBM’s DB2

Design Advisor [30], Microsoft’s Database Tuning Advisor [1], and Oracle’s SQL

Access Advisor [29].

Once the design features are built, the query optimizer in the database optimizes

queries to run them efficiently using those features. Since the query optimizer is the

consumer of the design features, physical design process depends on modeling of the

optimizer.

2.2 Physical Design using Optimizer Cost Models

Early research on physical designer focuses on modeling the query optimizer math-

ematically and then suggesting the design features. Since the early optimizers use

3

simple cost models [26], it is easy to model the entire optimization process and select

appropriate design features accurately. Lum et al. model the selection of secondary

indexes as an optimization problem [20]. Esiner et al. improve on that by mapping

the index selection problem to an equivalent network flow problem [13]. Researchers

also propose various integer linear program formulations for vertical partitioning of

tables [5, 10, 16, 22]. They all assume simple cost models for using vertical partitions

and build their optimization problem on those models. Because of the rapid im-

provement of optimizers over last few decades, the cost model used in these problem

formulations are obsolete now. We improve upon these formulations by modeling the

optimizer as a black-box and reusing the past optimization results. This allows us to

model the optimizer just enough to speed up the cost estimation, while avoiding the

complexity of fully modeling it [23].

2.3 Modern Physical Design using “what-if” Features

Over time the commercial grade optimizers have become complex and cannot be

modeled easily. Therefore, to find the effects of different design features on a query,

it is necessary to invoke the optimizer itself. Instead of building real design features,

all modern physical design selection techniques depend on creating “what-if” features

and observing the optimizer behavior using those features. The “what-if” features

are virtual in nature, they only simulate real features i.e., contain statistics similar to

the real features without building the feature’s underlying data.

Microsoft’s Index Tuning Wizard and Database Tuning Advisor use a ‘candidate

selection’ phase to identify promising indexes [1, 8], a ‘merging’ step for augmenting

the candidate set and a greedy search for selecting a locally optimal solution [9]. The

DB2 Advisor uses the optimizer to select an initial set of indexes and formulates a

knapsack problem that is solved by a greedy search [14].

Our ILP formulation is an extension of facility-location problem [12], enabling it

to exhaustively search the features, instead of greedily searching them. Caprara et al.

first use this formulation to find single index per query [7]. We extend their formula-

tion to account for queries using more than one indexes and also model index update

costs. Our work is concerned with applying the formalism to real-world problems,

that involve commercial database systems and workloads.

In terms of modeling, we integrate the facility-location problem model with the

query optimizer in existing systems. We also deal with the problems of selecting

candidates and configurations using sensitivity analysis, so that we derive reasonably-

4

sized ILP instances that can be solved efficiently. In terms of solutions, we may

not achieve the optimal solutions, as the model, due to candidate and configuration

selection, already incorporates approximations. Using our technique, however, we

can determine the distance from optimal solutions. Finally, to accurately model real

world constraints, we include storage limits in our formulation. Caprara et al. do

not consider storage and it is unclear if their analysis and proposed algorithms can

be applied to problem instances resulting from our formulation.

Heeren et al. describe an ILP and a solution based on randomized-rounding,

assuming a single index per query [15]. Their solution has optimal performance but

requires a bounded amount of additional storage.

Besides index selection, there is work on designing additional features, such as ma-

terialized views and table partitions [2,4,30,31]. The proposed algorithms are similar

to their index selection counterparts, only they are facing a combinatorial explosion

in the number of alternative designs that arises when exploring feature combinations.

Our modeling approach will be beneficial for problems combining multiple features,

because it can better capture the ‘interaction’ between features and it offers higher

scalability. Rao et al. extend the physical design selection to parallel databases using

a rank-based index selection [25]. Our model is more powerful than rank-based index

selection, hence, is applicable to parallel databases as well.

Agrawal et al. model the workload as a sequence [3] and Bruno et al. improve

on it by modeling the workload as an online query sequence [6]. They change the

workload model, but still use greedy index selection. Furthermore, their space of

possible query plans is limited, as they allow only local changes to the plan nodes.

Our approach considers complete changes in the plan, and selects better indexes than

the greedy approach. Recently, Tata et al. propose a techqnique for selecting physical

design for all major databases using a common client-based tool [27]. Our server-based

approach is complimentary to their client-based approach. Moreover, since our INUM

model is portable across multiple databases, it can simplify implementation of such

client-based tools.

Below we summarize the limitations of current approaches.

• Sub-optimal solutions: Empirically we verify that the solutions provided by

the design tools are far from optimal. For TPC-H like workload a commercial

system provided only 50% of the optimal improvement in query time. This

shows that the greedy selection of candidate solutions is not effective in selecting

the most optimal solutions.

5

• No quality guarantee: The existing tools do not allow the users to trade

speed with quality of the final solution. Although they allow the user to control

how long the designer tool can run in order to come up with the solution, they

do not provide any indication of how the running time affects the final solution.

• Difficulty in resource planning: The existing tools assume that the user

knows exactly how much resource the new physical designs are allowed to con-

sume. When the user wants to plan on possible resource acquisition, however,

it is desirable to present the user with a whole range of query performance for

different resource requirements. For example, if the user knows that by allowing

10% more disk to be used for new indexes, increases the performance by 20%,

she could plan for bigger disk space. This is not easily achieved in the current

tools.

• No solution for alternative data layouts: The column-oriented databases

are becoming more popular for DSS-type workloads consisting of analytical

queries, and in-memory databases are projected to take over the OLTP work-

loads consisting of transactional queries. There are no tools in the market,

however, to suggest physical design for these databases. The physical design for

in-memory databases is critical because memory is much more expensive than

disks, requiring the selection of the most optimal design features.

• No solution for scientific databases: The scientific workloads use different

query types, when compared to the transactional or analytical workloads. For

example, they use recursive queries, and time-windowed queries in addition to

standard queries. The mainstream physical design tools fail to find indexes or

materialized views for these query types.

3 Thesis Outline

This thesis proposes to address the problems discussed above by formulating the

various physical design problems as standard optimization problems and then solving

them systematically using existing tools. The thesis attempts to solve the problems

in the following steps:

1. Accurate and fast cost estimation: (Completed) We design a fast and

accurate cost estimator for queries with different physical design configurations.

We speed up the cost estimation by caching the previous optimizer outputs and

6

reusing them for future configurations. Reusing the optimizer outputs ensures

that our cost model keep matching the model of the optimizer. We extend our

earlier work [23] by modeling the cost for materialized views and partitions.

2. Convex optimization formulation: (Completed) Using our caching frame-

work we model the design selection as a convex optimization problem. We then

solve it using industry strength optimization solver tools.

3. Online design selection: (Ongoing) So far we consider the cases when the

workload is already known to us. When the workload is not known, then we

formulate the design selection problem as an online optimization problem. In

particular, we are studying the application of online optimization to physical

design of caches.

4. Physical design for alternative data stores: Physical design is an impor-

tant part of the alternative database designs such as column stores. Since the

existing column store databases depend heavily on selecting the right set of

columns to group together, tuning the design with the workload is extremely

important. We plan to model the design problem including the grouping, order-

ing and compression requirements of the column stores to achieve the optimal

design.

5. Physical design for unstructured databases: In unstructured databases

we lack the statistics information of the data on which queries are working on.

We target the problem of optimizing the physical design for unstructured data

queried using tools like map-reduce [11], Dryad [18] and Hadoop [17]. The phys-

ical design algorithm needs to learn the data and the workload characteristics

to optimize the performance of the workload.

6. Case study – simulation and observational databases for astronomy:

(Ongoing) We intend to apply our design selection methods to optimizing the

scientific workload. Scientists have different types of workload such as involv-

ing moving object, or massive recursive queries. We intend to apply and if

needed extend our methods to select the optimal design features for the scien-

tific database workloads.

Step 1 enables us to build large scale optimization problems by speeding up the

cost estimation. Steps 2 and 3 address the problems like sub-optimality, lack of

quality guarantee, and difficulty in resource planning. By formulating the physical

design problem as an optimization problem, we obtain optimal solutions. We allow

7

the user to speed up the optimization process by reducing the design search space

and provide guarantees on the loss of quality because of the reduced search space.

Since the space constraint is just one of the thousands of constraints in our problem

formulations, and modern solvers can easily re-optimize for small changes in the

problem, we can provide the entire spectrum of design solutions by altering the space

constraints. Step 4 and 5 select the physical designs for alternate data layouts, and

unstructured databases. Finally, Step 6 addresses the lack of physical design tools for

scientific database community.

4 Efficient Use of Optimizer for Index Selection

Problem

State-of-the-art database design tools rely on the query optimizer for comparing be-

tween physical design alternatives. Although it provides an appropriate cost model

for physical design, query optimization is a computationally expensive process. The

significant time consumed by optimizer invocations poses serious performance limita-

tions for physical design tools, causing long running times, especially for large problem

instances. So far it has been impossible to remove query optimization overhead with-

out sacrificing cost estimation precision. Inaccuracies in query cost estimation are

detrimental to the quality of physical design algorithms, as they increase the chances

of “missing” good designs and consequently selecting sub-optimal ones. Precision loss

and the resulting reduction in solution quality is particularly undesirable and it is the

reason the query optimizer is used in the first place.

In our approach, instead of replacing the optimizer completely, we cache the pre-

viously returned plans from the optimizer and reuse them to provide accurate and

fast cost estimation. The intuition behind the INUM is that although design tools

must examine a large number of alternative designs, the number of different opti-

mal query execution plans and thus the range of different optimizer outputs is much

smaller. Therefore it makes sense to reuse the optimizer output, instead of repeatedly

computing the same plan.

The INUM works by first performing a small number of key optimizer calls per

query in a precomputation phase and caching the optimizer output (query plans along

with statistics and costs for the individual operators). During normal operation,

query costs are derived exclusively from the precomputed information without any

further optimizer invocation. The derivation involves a simple calculation (similarly to

8

computing the value of an analytical model) and thus is significantly faster compared

to the complex query optimization code.

We present the details of the INUM model in our earlier work [23], and only list

the important implications of the INUM model below.

1. If a query involves only Merge-Join and Hash-Join plans, the cost of joining

and aggregation does not depend on the cost of accessing data from the table

or indexes. Hence, the total cost of the query depends linearly on the access

costs for each table.

2. If a query involves only Merge-Join and Hash-Join plans, then caching one plan

per interesting-order combination is sufficient to answer all possible incoming

index combinations. We define interesting-order combination as a tuple con-

taining at most one interesting-order for each table referenced in the query.

The interesting-order is a column which helps in query performance when or-

dered [26].

3. For queries involving all join methods including Nested-Loop Joins, we cache

more than one plan per interesting-order combination and achieve 5% cost ap-

proximation for the optimal plan cost.

Experimental Results for INUM We implement INUM using Java (JDK1.4.0)

and interface our code to the optimizer of a commercial DBMS, which we will call

System1. Our implementation demonstrates the feasibility of our approach in the

context of a real commercial optimizer and workloads and allows us to compare di-

rectly with existing index selection tools. To evaluate the benefits of INUM, we build

on top of it a very simple index selection tool, called eINUM. eINUM is essentially

an enumerator, taking as input a set of candidate indexes and performing a simple

greedy search, similar to the one used in [8].

We choose not to implement any candidate pruning heuristics because one of our

goals is to demonstrate that the high scalability offered by INUM can deal with large

candidate sets that have not been pruned in any way. We “feed” eINUM with an

exhaustive candidate set generated by building an index on every possible subset

of attributes referenced in the workload. From each subset, we generate multiple

indexes, each having a different attribute as prefix. This algorithm generates a set of

indexes on all possible attribute subsets, and with every possible attribute as key.

We experiment with the 1GB version of the TPC-H benchmark with a workload

consisting of 15 out of the 22 queries, which we call TPCH15. We were forced to

9

omit certain queries due to limitations in our parser but our sample preserves the

complexity of the full workload.

We use a dual-Xeon 3.0GHz based server with 4 gigabytes of RAM running Win-

dows Server 2003 (64bit).

On average eINUM takes only 0.34 milliseconds to find the cost of a configuration,

while the commercial system took about 1.3 seconds. The significant speed up in the

cost estimation allowed eINUM to suggest better indexes as well.

The construction of the INUM takes 1243s, or about 21 minutes, spent in per-

forming 1358 “real” optimizer calls. The number of actual optimizer calls is very

small compared to the millions of INUM cost evaluations performed during tuning.

We can also “compress” the time spent in INUM construction for smaller problems.

System1 required 246 seconds of total tuning time, and optimization time accounted

for 92% of the total tool running time.

5 An ILP Model for Index Selection

In this section, we introduce an integer linear programming formulation that captures

the full complexity of the index selection problem.

5.1 Mathematical Formulation

Consider a workload consisting of m queries and a set of n indexes I1-In, with sizes

s1-sn. We want our model to account for the fact that a query has different costs

depending on the combination of indexes it uses. A configuration is a subset Ck =

{Ik1, Ik2, ...} of indexes with the property that all of the indexes in Ck are used by

some query.

Let P be the set of all the configurations that can be constructed using the indexes

in I and that can potentially be useful for a query. For example, if a query accesses

tables T1, T2 and T3 then P contains all the elements in the set (indexes in I on T1)

× (indexes in I on T2) × (indexes in I on T3).

The cost of a query i when accessing a configuration Ck is c(i, Ck) and c(i, {})

denotes the cost of the query on an unindexed database. We define the benefit of a

configuration Ck for query i by bik = max(0, c(i, {}) − c(i, Ck)).

Let yj be a binary decision variable that is 1 if the index is actually implemented

and 0 otherwise. In addition, let xik be a binary decision variable that is equal to 1

10

if query i uses configuration Ck and 0 otherwise.

Using xik and bik, the benefit for the workload Z is

Z =

m∑

i=1

p∑

k=1

bik × xik (1)

where p = |P |. The values of xik depend on the values for yj: We cannot have a query

using Ck if a member of Ck is not implemented.

Also, we require that a query uses at most one configuration at a time. For

instance, a query cannot be simultaneously using both C1 = {I1, I2, I3} and C2 =

{I1, I2}. Finally, we require that the set of selected indexes consumes no more than

S units of storage. Thus the formal specification of the index selection problem is as

follows.

maximize Z =

m∑

i=1

p∑

k=1

bik × xik (2)

subject to

p∑

k=1

xik ≤ 1 ∀i (3)

xik ≤ yj ∀i, ∀j, k : Ij ∈ Ck. (4)

n∑

j=1

sj × yj ≤ S (5)

Equation 3 guarantees that a query uses at most one configuration. Equation 4

ensures that we cannot use a configuration k unless all the indexes in it are built and

Equation 5 expresses the constraint for available storage S. Observe that the space

restriction is just one variable in the ILP problem generated. Hence, unlike earlier

heuristic approaches, we can consider different storage constraints by tweaking the

problem slightly.

The above ILP formulation can be easily extended to account for update costs.

With every index Ij we associate a (negative) benefit value −fj , which denotes to cost

of updating the index Ij over all the update statements in the workload. Hence, we

modify the objective function in Equation 2 to take into account the negative benefit

values fj .

11

maximize Z =

m+m1∑

i=1

p∑

k=1

bik × xik −

n∑

j=1

fj × yj (6)

Equation 6 describes the workload benefit in the presence of m queries and m1

update statements. The second term simply states that if index Ij is constructed as

part of the solution, it will cost fj units of benefit to maintain it in the presence of

the m1 update statements.

Supporting clustered indexes is straightforward with our model. A candidate

clustered index is yet another index in the candidate set, one that contains all the

attributes in a relation. We allocate a yj variable to it as usual. It also participates

in combinations naturally. The size of clustered indexes is artificially set to 0 (as no

additional space is required to sort a table). For each table T we restrict the set of

clustered indexes on it, say {yT c1, yT c

2, ...yT c

l} so that only one clustered index is picked:

yT c1

+ yT c2

+ ... + yT cl≤ 1 (7)

The exact solution provided by the ILP formulation is optimal for the given initial

selection of indexes. If we were to include all the possible indexes that are relevant

to the given workload, it would give us the globally optimal solution.

Considering the set of all the possible indexes is prohibitively expensive and thus

a candidate selection module is necessary. The ILP approach is flexible enough that

we can use it with an arbitrary candidate index set, in fact the ILP model enables us

to derive tight bounds on the quality loss when we remove some candidates.

5.2 An ILP-based Index Selection Tool

In the previous section, we present an ILP formulation that completely describes the

index selection problem. In this section, we discuss the architecture of a practical

ILP-based index selection tool.

Fig. 1 details the components that are used in our tool. All the modules except

for the ILP solver are used in the construction of the model, deciding the xk and yj

variables and computing the benefit values bk, all described in Section 5. Once the

model is constructed, the ILP solver is used to determine the optimal solution.

The Candidate Selection and Combination Selection modules allow the determina-

tion of the xk and yj decision variables from the problem at hand. For each combina-

tion xk participating in the model, the Cost Estimation module determines the query

12

Candidate

Selection

Combination

Selection

ILP

ModelINUM

OptimizerWorkload Constraints

ILP

Solver

Solutions Bounds

Cost

Estimation

Figure 1: Architecture for an ILP-based index selection algorithm.

costs and corresponding ck benefit values. Cost Estimation is typically based on the

query optimizer. In our system, we couple the query optimizer to the Index Usage

Model (INUM), a mechanism we have developed for improving the performance of

query cost estimation through caching and reusing of optimizer computation. The

INUM is ≈ 3, 000 times faster than a query optimizer call, while providing exactly the

same result. Section 4 provides an overview of the INUM, while a detailed description

appears in [23]. The completed ILP representation is consequently input to the ILP

solver.

The overall performance depends on the numbers of yj and xk variables, which

control the problem size and consequently the optimization time and the time spent

in the cost estimation module. Our ILP formulation is advantageous compared to

existing approaches in that the efficiency of modern ILP Solvers and of the INUM

allows us to consider very large numbers of decision variables (candidate indexes and

index combinations). In our experiments, we have been able to solve ILP instances

considering up to 110,000 candidate indexes and 3.2 million combinations within

minutes.

Furthermore, our ILP formulation allows for a particularly attractive modular

design, where the impact of optimizations in each module on the final solution can be

analyzed and quantified. Each module can apply cost-based pruning (for Candidate

and Combination Selection) and approximations (for Cost Estimation), to improve

performance.

13

5.3 Experimental Results on TPC-H

We experiment with a workload consisting of 1000 queries, generated by changing the

parameters for the TPC-H queries, using the QGEN utility. We compare our ILP-

based approach against the commercial index selection tool integrated in our server,

for solution quality, for various amounts of storage space available for index selection.

Table 1 shows the number of candidates and combinations considered by our

ILP-based and the commercial index selection tool. The first column of the table

corresponds to the size of the resulting ILP instance. The data for the second column

is obtained through the profiling of the commercial tool. Due to the efficiency of our

ILP solver and the INUM, our ILP tool was able to handle an order of magnitude

more indexes and combinations.

ILP Commercial

Combinations 348147 42189

Candidates 1931 138

Table 1: Number of candidates and combinations considered by our ILP-based and

the commercial index selection tool.

40

50

60

70

80

1GB 3GB

Index Size

Wo

rklo

ad

Sp

ee

du

p (

%)

Commercial ilp ilp_cost_approx

Figure 2: Comparing the solution quality between our ILP-based and the commercial

index selection tool for two storage constraint values.

Figure 2 compares the solution quality achieved by our ILP-based and the com-

mercial tool for 1GB and 3GB storage constraint values. For the “tight” storage

constraint of 1GB, the ILP algorithm (ilp) provides 16 more percentage points of

benefit to the workload compared to the commercial tool (commercial). Equivalently,

when using the indexes recommended by our ILP tool the workload runs 30% faster.

Figure 2 also shows the solution quality achieved using cost approximation, in

addition to table-subset pruning (ilp cost approx). In this case we apply cost approx-

14

imation using a single plan per query in the INUM space. The quality loss resulting

from cost approximation is less than 3 percentage points in both cases.

6 Integer-Convex Programming Model

Although above approach allows us to accurately model the cost index selection prob-

lem, the number of configurations explode combinatorially. Only pruning some of the

configurations makes the above formulation practical. In this section, we propose a

different formulation of the problem, which makes the optimization program construc-

tion even more practical. We reduce the program size by exploiting the model created

by INUM, and the solver’s ability to compute optimal combinations efficiently.

We consider a workload consisting of m queries and a set of n indexes I1-In, with

sizes s1-sn. Instead of maximizing the benefit of using a configuration, we try to

minimize the cost of using it. Let Oij be the jth interesting-order combination for

the query i. Since, we cache multiple plans for each interesting-order, let Pijk be the

kth plan cached for the interesting-order combination Oij. According to INUM’s cost

model, the cost for the configurations matching an interesting-order combination Oij

can be represented as:

Cost(Oij) = mink

wijk +∑

t∈Ti

wijktati

Where Ti is the set of tables accessed in the query i, ati is the cost of accessing

table t in query i, and wijk, wijkt are the constants for the linear cost equation in

INUM. Using above equation, we can represent the cost of query i, Ci as

Ci = minj

Cost(Oij) = minjk

wijk +∑

t∈Ti

wijktati

Therefore, the total cost of the workload C becomes

C =∑

i

(

minjk

wijk +∑

t∈Ti

wijktati

)

In this form above formulation is not a convex program, as the min function is not

a convex function. To convert it to convex program, we introduce a binary variable

oir, which represents the cached plan selected for the query i. The variable oir is

essentially an indicator constraint for selection of an interesting-order r of query i.

15

We also reuse the variable yj to represent selection of an index Ij. The objective of

the equation becomes:

minimize∑

i

Ci

Such that:

oir = 1 ⇒ Ci = (Cirint +

∑

t∈Ti

wirtStiyj)

Commercial programs such as CPLEX provide efficient implementation of such

indicator variables.

To maintain correctness of the formulation, we have to ensure that (1) the variable

oir is picked only when all the required indexes providing the interesting-orders for

that cached plans are selected; (2) for each query one interesting-order is picked; (3)

only one index is used to decide the cost of a query plan.

We introduce another binary variable yirj, which indicates that the index Ij is

used by rth plan of query i.

minimize∑

i

Ci (8)

Such that: (9)

oir = 1 → Ci = Cirint +

∑

t∈Ti

wirtStiyirj (10)

∀i,∑

r

oir = 1 (11)

∀i, ∀t

∑

yirj∈Uses(t)

yirj = 1 (12)

∀j , yj ≤∑

yirj ≤ myj (13)

∀i, ∀r, ∀t∈Ti,∑

yirj∈Virt

yirj ≥ oir (14)

∑

j

sj × yj ≤ S (15)

yT c1

+ yT c2

+ ... + yT cl≤ 1 (16)

Where Virt represents the set of indexes on table t, which provide the interesting-

order required by oirt, and the set Uses(t) is the set of indexes on table t. Equation

16

10 ensures that if we select a plan, then we also select its associated cost. Equation 11

ensures that at least one plan is selected for each query. Equation 12 ensures that for

a query, only one index is selected per table. Equation 13 ensures that if we select

an index for a plan, then the global index is selected as well. The Equation 15 is the

space constraint. The last constraint ensures that we select only one clustered index

for each table.

Unlike the formulation in Equation 6, this formulation does not enumerate all

configurations. Therefore, the problem size remains linear with respect to the size of

candidate index set. To understand the variable and constraint size, let’s consider

a simple query with just one possible plan – a join on two tables. If t1, and t2 are

the number of possible candidate indexes on those two tables, then the number of

possible configurations for ILP is t1 × t2. Therefore the number of variables in ILP

is O(t1 × t2 + t1 + t2) and the number of constraints is O(t1 × t2). In ICP, however,

the number of variables and the constraints is approximately O(t1 + t2). The ICP

formulation scales linearly with the number of cached plans, but ILP is independent

of the number of cached plans.

ILP vs. ICP: Unlike ICP, the ILP formulation does not assume any internal

details for the cost estimation module. Given any cost estimation module, if it is fast

enough to support millions of optimization queries, it would be possible to build an

ILP. For large problems, however, the ILP formulation needs pruning methods to keep

the problem size under control. Although, the pruning methods do not affect the final

solution in our experiments, the upper bounds on the quality loss are relatively high,

making them less desirable than no pruning at all. The ICP formulation depends on

the details of the cost estimation module (INUM in our case), however, by exploiting

the internal details, it generates a smaller optimization problem. Furthermore, the

cost bounds determined using the ICP formulation is much tighter when compared to

the cost bounds for ILP formulation. Although we build ICP on top of INUM’s cost

model, we can adopt any other convex formulation as the basis for ICP. Most modern

optimizers use dynamic programming to determine the optimal plan for a query, which

can be easily modeled as a convex optimization problem [24]. Hence more complex

model for the optimizer using dynamic programming can also be modeled using ICP.

17

7 Applying Physical Design Techniques to Scien-

tific Databases

Scientists have very specific needs for their databases. For example, the simula-

tion databases generated by scientists need recursive queries to determine the ori-

gin/termination of particles in the space-time continuum. Unsurprisingly, the recur-

sive queries are not part of the database benchmarks such as TPC-H, TPC-DS, etc.

Hence the commercial vendors do not support the optimal physical design of the such

queries.

To design the database for scientific workload, we use a database to manage the

output of an astronomy experiment, consisting of 128 gigabytes of raw data. Since

the data is generated in number of steps, we design a loader which combines the

data from different steps and load them together into the database. The regime

where database technology can make the greatest impact for simulations is tracking

group evolution over time. This is because simulation outputs are typically stored as

snapshots. Thus, queries that focus on a single instant in time have an advantage over

ones that examine time evolution simply because the file format naturally lends itself

to this kind of inquiry. We find that scientists correspondingly limit their own research

on this basis, preferring the “low lying fruit” of problems that can be addressed using

single snapshot (or maybe comparing two snapshots) and avoiding the ones that

require looking at a range of snapshots. Our current query set, therefore, focuses on

the time domain. We implement 6 such representative queries on the data, so that

the astronomers can query the data by changing query parameters.

The most common type of query is “going back in time”, i.e. identifying a set

of progenitors for a given set of “particle groups”. Implementing this traversal in

a database is not straightforward, requiring the development of “recursive” queries,

which are not optimized in existing systems. The recursive aspect arises from the

nature of the data. A given cosmological group g in output t knows its group ID in

output t−1 and t+1. To follow the evolution of a group through many outputs, one

must trace these links through many sequential outputs: e.g. galaxy 123 in output 10

is the same as galaxy 186 in output 9 which is the same as galaxy 452 in output 8, etc.

Optimizing recursive query performance over large datasets is not well understood,

and the tools currently available for automated indexing and partitioning do not work

for recursive queries. We initially experiment with the obvious choice of flattening

our database structures to eliminate recursive queries, but that significantly increases

data sizes.

18

We implement a hybrid approach of the existing techniques and completely flat-

tening the time-steps. We do not materialize the complete progenitor relationships,

but materialize them in steps. In the sample database, the progenitors are recorded

in 256 time steps. Instead of materializing 255 different progenitor tables (one for

each time step difference), we materialize only 8 progenitor tables. Each progenitor

table contains the information 2i time steps away, where i = 1, 2, 3, ..., k. In our case

we keep 1, 2, 4, 8, 16, 32, 64, and 128 hop progenitor information. This allows us to

obtain accurate progenitor information using only 7 joins instead of 255 joins. This

method is possible because of the following two properties of the data:

1. For most of the queries, we know the starting and the ending time steps.

2. Inspecting the progenitor tree, we observe that on average the fan-out at each

node is considerably small compared to 2. This allows the tables at 2i hops a

size similar to the original table. The average degree of the tree is 1.018 in the

sample data.

1

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

10000000000

Original Hybrid Full*

Materialization Scheme

Nu

mb

er o

f R

ow

s

(a) The space usages of different

schemes

0.1

1

10

100

Original Hybrid Full**

Materialization Schemes

Tim

e t

o r

un

th

e q

uery (

in

Seco

nd

s)

(b) The query times of different schemes

Figure 3: Performance comparison of hybrid methods against naive methods

Figures 3(a) and 3(b) show that the hybrid scheme is only 0.02% size of the full

materialization, while only 14 times slower than the time taken by the full material-

ization. The hybrid scheme is about 12 times faster than the naive implementation.

This benefit increases when we run on more than 256 time steps in the simulation.1

7.1 Physical Design of Distributed Database Caches

Distributed databases usually maintain proxy caches to server large number of queries.

Unlike the standard web-based workloads the queries sent to the proxy caches are

1This is a joint work with Andy Connelly and Jeff Gardner at University of Washington

19

highly dynamic and evolve over time. Therefore, selecting a representative workload

to tune using the existing physical design tools is difficult. Caches present a unique

design point for the physical design tools, as the designs can be optimized both at

the client site and the source site. Often they can be optimized together to provide

the best performance and low network bandwidth. In this work, we design both

competitive and incremental algorithms for physical design of the such distributed

caches [21].2

8 Things to be done

8.1 Online Optimization

In our current work, we assume full knowledge of the queries in the system. This is

true for the standard workload such as web applications, or TPC-H benchmarks. In

many cases, however, the user can form ad-hoc queries and then we do not have prior

information about the workload to optimize on. We need to suggest the the future

physical design with the information about the past workload. Since the workload

can change in future and implementing the design features take time and space, we

need to suggest design changes which balance the query time performance and the

design changes appropriately.

8.2 Alternative Data Layouts

Column oriented databases provide better access to the data when the data is mostly

read only. They also provide good support for ad-hoc queries. To allow our convex

optimization techniques to be used in this framework, we will model the query costs

in the column-oriented databases. Since the column-oriented database optimizers

are not as complex as the ones for row databases, we can take advantage of better

and simpler models for this problem. Similarly, for in-memory OLTP databases, the

focus will be on the change modeling. So far we do not consider change as a big

factor in the physical design, that assumption is completely invalid in the case of

OLTP databases. This work should provide interesting applications of our convex

optimization techniques on these new problem spaces.

2This is a joint work with Xiaodan Wang of John Hopkins University, and Tanu Malik of Purdue

University

20

8.3 Physical Design for Unstructured Databases

In the case of unstructured databases, we do not know the queries, as well as the

structure of the data. In the techniques currently used to process unstructured data,

such as map-reduce [11] and Hadoop [17], each and every time the user parses the

data afresh and processes them. This makes the queries easy to implement, but does

not take advantage of many database innovations such as using statistics to model

the query cost, physical design features for speed. We intend to study the application

of the convex optimization methods to the map-reduce framework to find the optimal

materialization points and reuse the work done by previous map-reduce jobs to speed

up the future tasks.

8.4 Case Study: Scientific Database Optimization

Scientific data continue to give interesting optimization problems. For example, one of

the recent optimization goal for our collaborators is to design the spatial and temporal

indexes to use least space possible. The space requirement is of paramount interest

to the scientific database community and our optimization techniques are ideal for

such constrained environments, where heuristics lead to suboptimal behavior. We

intend to show case our techniques using scientific data management problems as

case studies.

21

9 Time-line

Task Task Description Start Date End Date

1 ILP/ICP formulations 1-Apr-08 1-Nov-08

1.1 Scaling the ILP using sampling 1-Apr-08 27-Jun-08

1.2 Integrating Materialized Views into INUM 1-Feb-07 27-Jun-08

1.3 Integrating all features into INUM 1-Jul-08 1-Oct-08

1.4 Scaling ICP using solver techniques 1-Jul-08 1-Nov-08

2 Online physical design 1-Apr-08 27-Jun-08

2.1 ILP for vertical partitioning 1-May-08 27-Jun-08

2.2 INUM for vertical partitioning 1-May-08 27-Jun-08

2.3 Online algorithm 1-May-08 27-Jun-08

3 Physical design for cumin store 1-Dec-08 1-Apr-09

3.1 Cost model for column stores 1-Dec-08 1-Jan-09

3.2 ILP for column stores 1-Jan-09 1-Apr-09

3.3 Improving column store for scientific DB 1-Dec-08 1-Apr-09

4 Physical design for Hadoop 1-Apr-09 1-Nov-09

4.1 Identifying the physical design features 1-Apr-09 1-Nov-09

4.2 Learning data characteristics 1-Apr-09 1-Nov-09

4.3 Learning workload characteristics 1-Apr-09 1-Nov-09

4.4 Cost model 1-Apr-09 1-Nov-09

5 Physical design for Astronomy database 1-Apr-07 1-Jan-10

5.1 Collecting and loading data 1-Apr-07 1-Mar-08

5.2 Optimizing recursive queries 1-Oct-07 1-Jan-08

5.3 Optimizing spatial moving object queries 1-Apr-09 1-Oct-08

5.4 Implementing physical design features 1-Oct-09 1-Jan-10

22

References

[1] Sanjay Agrawal, Surajit Chaudhuri, Lubor Kollar, Arunprasad P. Marathe,

Vivek R. Narasayya, and Manoj Syamala. Database tuning advisor for microsoft

sql server 2005. In VLDB, pages 1110–1121, 2004.

[2] Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. Automated selec-

tion of materialized views and indexes in SQL databases. In VLDB 2000.

[3] Sanjay Agrawal, Eric Chu, and Vivek Narasayya. Automatic physical design

tuning: workload as a sequence. In SIGMOD ’06: Proceedings of the 2006 ACM

SIGMOD international conference on Management of data, pages 683–694, New

York, NY, USA, 2006. ACM.

[4] Sanjay Agrawal, Vivek Narasayya, and Beverly Yang. Integrating vertical and

horizontal partitioning into automated physical database design. In Proceedings

of the SIGMOD Conference, New York, NY, USA, 2004. ACM Press.

[5] Jair M. Babad. A record and file partitioning model. Commun. ACM, 20(1):22–

31, 1977.

[6] Nicolas Bruno and Surajit Chaudhuri. An online approach to physical design

tuning. In IEEE 23rd International Conference on Data Engineering (ICDE

2007), 2007.

[7] A. Caprara and J. Salazar. A branch-and-cut algorithm for a generalization of

the uncapacitated facility location problem, 1996.

[8] Surajit Chaudhuri and Vivek R. Narasayya. An efficient cost-driven index selec-

tion tool for Microsoft SQL server. In Proceedings of VLDB 1997.

[9] Surajit Chaudhuri and Vivek R. Narasayya. Index merging. In Proceedings of

ICDE 1999.

[10] D.W. Cornell and P.S. Yu. An effective approach to vertical partitioning for phys-

ical design of relational databases. IEEE Transactions on Software Engineering,

16(2):248–258, 1990.

[11] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on

large clusters. Commun. ACM, 51(1):107–113, 2008.

[12] Zvi Drezner and Horst W. Hamacher. Facility Location: Applications and Theory.

2001.

23

[13] Mark J. Eisner and Dennis G. Severance. Mathematical techniques for efficient

record segmentation in large database systems. 1976.

[14] G.Valentin, M.Zuliani, D.Zilio, and G.Lohman. DB2 advisor: An optimizer

smart enough to recommend its own indexes. In Proceedings of ICDE 2000.

[15] C. Heeren, H. V. Jagadish, and L. Pitt. Optimal indexes using near-minimal

space. In PODS, 2003.

[16] Jeffrey A. Hoffer and Dennis G. Severance. The use of cluster analysis in physical

data base design. In Douglas S. Kerr, editor, Proceedings of the International

Conference on Very Large Data Bases, September 22-24, 1975, Framingham,

Massachusetts, USA, pages 69–86. ACM, 1975.

[17] http://hadoop.apache.org. Apache hadoop project.

[18] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly.

Dryad: Distributed data-parallel programs from sequential building blocks. In

European Conference on Computer Systems (EuroSys), pages 59–72, Lisbon, Por-

tugal, March 21-23 2007. also as MSR-TR-2006-140.

[19] Sam S. Lightstone, Toby J. Teorey, and Tom Nadeau. Physical Database De-

sign: the database professional’s guide to exploiting indexes, views, storage, and

more (The Morgan Kaufmann Series in Data Management Systems). Morgan

Kaufmann Publishers Inc., San Francisco, CA, USA, 2007.

[20] Vincent Y. Lum and Huei Ling. An optimization problem on the selection of

secondary keys. In Proceedings of the 1971 26th annual conference, pages 349–

356, New York, NY, USA, 1971. ACM.

[21] T. Malik, X. Wang, R. Burns, D. Dash, and Anastasia Ailamaki. Automated

physical design in database caches. In In Proceedings of SMDB 2008, 2008.

[22] Shamkant Navathe, Stefano Ceri, Gio Wiederhold, and Jinglie Dou. Vertical par-

titioning algorithms for database design. ACM Trans. Database Syst., 9(4):680–

710, 1984.

[23] Stratos Papadomanolakis, Debabrata Dash, and Anastasia Ailamaki. Intelligent

use of query optimizer for automated physical design. Technical Report CMU-

CS-06-151, CMU, 2006.

[24] A. Rantzer. Dynamic programming via convex optimization, 1999.

24

[25] Jun Rao, Chun Zhang, Nimrod Megiddo, and Guy Lohman. Automating physical

database design in a parallel database. In SIGMOD ’02: Proceedings of the 2002

ACM SIGMOD international conference on Management of data, pages 558–569,

New York, NY, USA, 2002. ACM.

[26] P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, and T. Price. Access path

selection in a relational database management system. In SIGMOD 1979.

[27] Sandeep Tata, Lin Qiao, and Guy M. Lohman. On common tools for databases

- the case for a client-based index advisor. In ICDE Workshops, pages 42–49,

2008.

[28] http://dbtools.cs.cornell.edu/norm_index.html. The database normaliza-

tion tool.

[29] http://www.oracle.com/technology/obe/11gr1_db/manage/sqlaccadv/

sqlaccadv.%htm. Improving schema design with sql access advisor.

[30] Daniel C. Zilio, Jun Rao, Sam Lightstone, Guy M. Lohman, Adam Storm, Chris-

tian Garcia-Arellano, and Scott Fadden. Db2 design advisor: Integrated auto-

matic physical database design. In VLDB, pages 1087–1097, 2004.

[31] Daniel C. Zilio, Calisto Zuzarte, Guy M. Lohman, Hamid Pirahesh, Jarek Gryz,

Eric Alton, Dongming Liang, and Gary Valentin. Recommending materialized

views and indexes with ibm db2 design advisor. In ICAC ’04: Proceedings of

the First International Conference on Autonomic Computing, pages 180–188,

Washington, DC, USA, 2004. IEEE Computer Society.

25

automatic physical design of databases systems: an ...ddash/thesis-proposal.pdf · automatic...

Documents