importance of database design

5/26/2018 Importance of Database Design

1/16

Global Business Services

2007 IBM Corporation

Importance of database design

Task management system for an insurancecompany as an example

Anne LesellJuha Puroranta


2/16


2003 IBM Corporation2

Agenda

Requirements & challenges

Logical datamodel and problems related to it

Resolution and the physical datamodel Lessons learned

Recommendations


3/16



Requirements

Web based task mgmt system for an insurance company

~3000 employees using the application daily

~1,2M tasks inserted to the system per year

Avg. 8 req/sec at peak hours

Avg. 2 req/sec during normal load

Max. response time for a query: 1 second

Max. time for entry and modification of a task: 0,5 second

Usage of dynamic SQL in order to allow usage of various searchcriteria.

Application: JSF based web application on WAS v6

Database: DB/2 for z/OS


4/16




5/16




6/16



Challenges

Three major functions with different search criterianot possible to use same indexes

tasks of an individual employee

tasks related to a individual customer

tasks of an individual business unit

Several different views to the data in each functioneven more indexes required

Customs of company(e.g. dynamic SQL forbidden, variable length and nullable fields not allowed)reasoning and explaining the benefits


7/16



Logical datamodel

TEHTV

TEHTAVANO

KSITTELYTIETO

TEHTAVANO

LISAYSAIKA

SIIRTOVIESTI

TEHTAVANO

MUIDEN_HOITAMAT

OMISTAJA

TEHTAVANO

TAPAHTUMA_AIKA

SIIRTO

TEHTAVANO

HISTORIA

TEHTAVANOAIKA


8/16



Problems with the logical data model

Too many indexes

Some of the indexes too big

Row size of teh tvtable was big only 4 rows would have fit into one pagefull table scans would have

taken a lot of time

Retrieving data in chunks (28 rows per query = 2 webpages)

Calculation of how many rows fulfill the search criteriawould have required a separate SQL query which in somecases could not have been done using only indexes

Response times would have been too slow


9/16



Resolution

Splitting the tehtvtable in the logical datamodel into three tables inthe physical datamodel The search criteria for the unfinished tasks are different from the finished tasks

Data of the finished tasks does not change in the database

allowed optimization of the data (clustering index) The description [varchar(1000)] of the task is needed only on one page while the other data of

the table is needed more frequently.

All data used in the software not to be stored as-is to the database.Instead to be derived from other information. E.g. status of a task

avoin tehtv = tekem taulu & omistaja id = null

ksittelyss oleva tehtv = tekem taulu & omistaja id != null

tehty tehtv = tehdyt taulu

Negotiating reduction of functional requirements limiting search criterianice-to-have criteria were removed calculating sums on only certain pages

Negotiation on non-functional requirements allowing longer response times for reports provided to managers of business units


10/16



Logical vs. physical datamodel

TEHTV

TEHTAVANO

KSITTELYTIETO

TEHTAVANOLISAYSAIKA

SIIRTOVIESTI

TEHTAVANO

MUIDEN_HOITAMATOMISTAJATEHTAVANOTAPAHTUMA_AIKA

SIIRTOTEHTAVANO

HISTORIATEHTAVANOAIKA

TEKEM

TEHTUN

KASTIETO

TEHTUNLAIKA

SVIESTI

TEHTUN

MUIDHOI

TEHTUNOMID

SIIRTO

TEHTUN

HISTORIA

HAIKAM

TEHDYT

TEHTUN

TKUVAUS

kuvausid


11/16



How indexes are identified

Rough index design principles

Index for primary key and foreign keys

Cluster order such that processing big result sets will use sequential I/O, notrandom

Aiming to three star indexes:* optimal matching columns (indexable predicates (z/OS) / range delimiting (LUW)

* avoid sorting

* no table access, index only

Column order in index:

Start the index with columns in equal predicates and IS NULL predicates, highcardinality columns first (indexable / range delimiting +boolean)

Add the column in the most selective range predicate (indexable, stage-1 / rangedelimiting, index-sargable, boolean term) for index screening

Add the remaining columns, so that Order by / Group by will not result in sort


12/16



How response times are calculated

VQUBE for DB2 for z/OS (very quick upper bound estimation)

Formula depends on the version of DB2, version of the computer and disk workload

We estimate the worse case, average only if it bears any meaning

We aim to avoid negative surprises in response times in true production

Formula is only for I/O bound SQL, different formula for CPU bound queries

Formula can be used for DB2 for LUW, often too pessimistic

VQUBE for DB2 for z/OS, z990, more than 400 MIPS processor

LocalResponseTime = # trandom x 10ms + # tsequential x 0,2ms + # sortr x 0,002ms


13/16



Example how calculation is done

============================================

30 Tehtv haut - omat tehtavat (ksittelyss)

============================================

SELECT t.tmvipu, t.ktmvipu, t.onvipu, t.tehtun, t.tehtyyppi,

t.kampno, t.tehot, t.asno, t.versio, t.tpvm, t.saika,

t.lapvm, t.tprty, t.omid, t.omorgy, t.kasryh, t.aikam

FROM tekem tWHERE t.omid = ?

AND t.tehtyyppi IN (?,?,...)

AND t.tehtun > ?

AND t.apvm , apvm

yhdell ksittelyss = 200

tehtun > = 200

TR TS sort

t3 1 3

T 200

LRT = tr201 x 10 + ts 3 X 0,1 + sortr x 0,01 = 2010 ms

jos haetaan vain 50 ensimmist rivi, koska ei sorttia:

TR TS sort

t3 1 3T 50

LRT = tr51 x 10 + ts 3 X 0,1 + sortr x 0,01 = 510 ms


14/16



Lessons learned

Involve a database specialist in the project during the analysis anddesign phase (include already in the project plan).

At testing phase it might be too late and might require a lot of refactoring of codeand database.

In Teha project no modifications to the database or software were needed after

the performance testonly accidentally missing index had to be added. Theactual response times were almost identical to the expected response times.

Anne spent around 50h for calculating the expected response times based onthe logical datamodel, refactoring and recalculating based on the new model.

About 15 mandays would have probably been needed to do only themodifications, if the problems would have been identified only during

performance testing. In addition performance tests would have had to be redonerequiring resource both from IBM and the customer.

Calculate the expected response times of the queries before design andespecially prior to implementation of the data access layer.


15/16



Recommendations

Gather what kind of queries need to be done to the database and what arethe response time requirements for each of them.

Convert the queries into SQL clauses and calculate the expected response times foreach of them

Verify if the response times are satisfactory. Keep in mind that often more than one SQL-query needs to be done against the database for a single http request.

Gather the search and sorting criteria

The search criteria for the different queries should be matching at least for couple ofcolumns. In addition couple of other search criteria based on other columns could beused.

The order of the results should be sorted based on search criteria, not by additional

columns not used are search criteria.

Use caching of data in the presentation layer when possible instead ofdoing table joins

E.g. customer name vs. customer id, organizational unit name vs. orgunit id


16/16



Recommendations

Gather characteristics of the database tables and their columns

Is the data stored in the table permanent or can it be altered?

How many rows will be stored in the table?

What is the expected growth rate of the rows in the table or does it remain constant or withincertain boundaries? Will the data be entered in chunks or shall it happen at a constant rate?

In which columns does the data change and their change frequency and in which it doesnt change

after it has been entered e.g. task id never changes, classification of the task changes rearly, assigned employee

changes frequently

allows to order of the columns in the table to be optimized so that writes to the rollback logs areminimal

What is the range of possible values in each column and are those values evenly used?

If the range of values is small and the table is huge, it might not be a good candidate for anindex?

If the range of values is big, but only 1 or 2 of them are used in 98% of the rows, it might not bea good candidate for an index?

Which of the columns are nullable and what proportion of the rows contain a null value in thatcolumn

Depending on the query that is done against the database, it is normally not a good column tobe used in a index

importance of database design

Documents