importance of database design

16
Global Business Services © 2007 IBM Corporation Importance of database design Task management system for an insurance company as an example  Anne Lesell Juha Puroranta

Upload: k007030

Post on 16-Oct-2015

10 views

Category:

Documents


0 download

DESCRIPTION

Importance of Database Design v5

TRANSCRIPT

  • 5/26/2018 Importance of Database Design

    1/16

    Global Business Services

    2007 IBM Corporation

    Importance of database design

    Task management system for an insurancecompany as an example

    Anne LesellJuha Puroranta

  • 5/26/2018 Importance of Database Design

    2/16

    Global Business Services

    2003 IBM Corporation2

    Agenda

    Requirements & challenges

    Logical datamodel and problems related to it

    Resolution and the physical datamodel Lessons learned

    Recommendations

  • 5/26/2018 Importance of Database Design

    3/16

    Global Business Services

    2003 IBM Corporation3

    Requirements

    Web based task mgmt system for an insurance company

    ~3000 employees using the application daily

    ~1,2M tasks inserted to the system per year

    Avg. 8 req/sec at peak hours

    Avg. 2 req/sec during normal load

    Max. response time for a query: 1 second

    Max. time for entry and modification of a task: 0,5 second

    Usage of dynamic SQL in order to allow usage of various searchcriteria.

    Application: JSF based web application on WAS v6

    Database: DB/2 for z/OS

  • 5/26/2018 Importance of Database Design

    4/16

    Global Business Services

    2003 IBM Corporation4

  • 5/26/2018 Importance of Database Design

    5/16

    Global Business Services

    2003 IBM Corporation5

  • 5/26/2018 Importance of Database Design

    6/16

    Global Business Services

    2003 IBM Corporation6

    Challenges

    Three major functions with different search criterianot possible to use same indexes

    tasks of an individual employee

    tasks related to a individual customer

    tasks of an individual business unit

    Several different views to the data in each functioneven more indexes required

    Customs of company(e.g. dynamic SQL forbidden, variable length and nullable fields not allowed)reasoning and explaining the benefits

  • 5/26/2018 Importance of Database Design

    7/16

    Global Business Services

    2003 IBM Corporation7

    Logical datamodel

    TEHTV

    TEHTAVANO

    KSITTELYTIETO

    TEHTAVANO

    LISAYSAIKA

    SIIRTOVIESTI

    TEHTAVANO

    MUIDEN_HOITAMAT

    OMISTAJA

    TEHTAVANO

    TAPAHTUMA_AIKA

    SIIRTO

    TEHTAVANO

    HISTORIA

    TEHTAVANOAIKA

  • 5/26/2018 Importance of Database Design

    8/16

    Global Business Services

    2003 IBM Corporation8

    Problems with the logical data model

    Too many indexes

    Some of the indexes too big

    Row size of teh tvtable was big only 4 rows would have fit into one pagefull table scans would have

    taken a lot of time

    Retrieving data in chunks (28 rows per query = 2 webpages)

    Calculation of how many rows fulfill the search criteriawould have required a separate SQL query which in somecases could not have been done using only indexes

    Response times would have been too slow

  • 5/26/2018 Importance of Database Design

    9/16

    Global Business Services

    2003 IBM Corporation9

    Resolution

    Splitting the tehtvtable in the logical datamodel into three tables inthe physical datamodel The search criteria for the unfinished tasks are different from the finished tasks

    Data of the finished tasks does not change in the database

    allowed optimization of the data (clustering index) The description [varchar(1000)] of the task is needed only on one page while the other data of

    the table is needed more frequently.

    All data used in the software not to be stored as-is to the database.Instead to be derived from other information. E.g. status of a task

    avoin tehtv = tekem taulu & omistaja id = null

    ksittelyss oleva tehtv = tekem taulu & omistaja id != null

    tehty tehtv = tehdyt taulu

    Negotiating reduction of functional requirements limiting search criterianice-to-have criteria were removed calculating sums on only certain pages

    Negotiation on non-functional requirements allowing longer response times for reports provided to managers of business units

  • 5/26/2018 Importance of Database Design

    10/16

    Global Business Services

    2003 IBM Corporation10

    Logical vs. physical datamodel

    TEHTV

    TEHTAVANO

    KSITTELYTIETO

    TEHTAVANOLISAYSAIKA

    SIIRTOVIESTI

    TEHTAVANO

    MUIDEN_HOITAMATOMISTAJATEHTAVANOTAPAHTUMA_AIKA

    SIIRTOTEHTAVANO

    HISTORIATEHTAVANOAIKA

    TEKEM

    TEHTUN

    KASTIETO

    TEHTUNLAIKA

    SVIESTI

    TEHTUN

    MUIDHOI

    TEHTUNOMID

    SIIRTO

    TEHTUN

    HISTORIA

    HAIKAM

    TEHDYT

    TEHTUN

    TKUVAUS

    kuvausid

  • 5/26/2018 Importance of Database Design

    11/16

    Global Business Services

    2003 IBM Corporation11

    How indexes are identified

    Rough index design principles

    Index for primary key and foreign keys

    Cluster order such that processing big result sets will use sequential I/O, notrandom

    Aiming to three star indexes:* optimal matching columns (indexable predicates (z/OS) / range delimiting (LUW)

    * avoid sorting

    * no table access, index only

    Column order in index:

    Start the index with columns in equal predicates and IS NULL predicates, highcardinality columns first (indexable / range delimiting +boolean)

    Add the column in the most selective range predicate (indexable, stage-1 / rangedelimiting, index-sargable, boolean term) for index screening

    Add the remaining columns, so that Order by / Group by will not result in sort

  • 5/26/2018 Importance of Database Design

    12/16

    Global Business Services

    2003 IBM Corporation12

    How response times are calculated

    VQUBE for DB2 for z/OS (very quick upper bound estimation)

    Formula depends on the version of DB2, version of the computer and disk workload

    We estimate the worse case, average only if it bears any meaning

    We aim to avoid negative surprises in response times in true production

    Formula is only for I/O bound SQL, different formula for CPU bound queries

    Formula can be used for DB2 for LUW, often too pessimistic

    VQUBE for DB2 for z/OS, z990, more than 400 MIPS processor

    LocalResponseTime = # trandom x 10ms + # tsequential x 0,2ms + # sortr x 0,002ms

  • 5/26/2018 Importance of Database Design

    13/16

    Global Business Services

    2003 IBM Corporation13

    Example how calculation is done

    ============================================

    30 Tehtv haut - omat tehtavat (ksittelyss)

    ============================================

    SELECT t.tmvipu, t.ktmvipu, t.onvipu, t.tehtun, t.tehtyyppi,

    t.kampno, t.tehot, t.asno, t.versio, t.tpvm, t.saika,

    t.lapvm, t.tprty, t.omid, t.omorgy, t.kasryh, t.aikam

    FROM tekem tWHERE t.omid = ?

    AND t.tehtyyppi IN (?,?,...)

    AND t.tehtun > ?

    AND t.apvm , apvm

    yhdell ksittelyss = 200

    tehtun > = 200

    TR TS sort

    t3 1 3

    T 200

    LRT = tr201 x 10 + ts 3 X 0,1 + sortr x 0,01 = 2010 ms

    jos haetaan vain 50 ensimmist rivi, koska ei sorttia:

    TR TS sort

    t3 1 3T 50

    LRT = tr51 x 10 + ts 3 X 0,1 + sortr x 0,01 = 510 ms

  • 5/26/2018 Importance of Database Design

    14/16

    Global Business Services

    2003 IBM Corporation14

    Lessons learned

    Involve a database specialist in the project during the analysis anddesign phase (include already in the project plan).

    At testing phase it might be too late and might require a lot of refactoring of codeand database.

    In Teha project no modifications to the database or software were needed after

    the performance testonly accidentally missing index had to be added. Theactual response times were almost identical to the expected response times.

    Anne spent around 50h for calculating the expected response times based onthe logical datamodel, refactoring and recalculating based on the new model.

    About 15 mandays would have probably been needed to do only themodifications, if the problems would have been identified only during

    performance testing. In addition performance tests would have had to be redonerequiring resource both from IBM and the customer.

    Calculate the expected response times of the queries before design andespecially prior to implementation of the data access layer.

  • 5/26/2018 Importance of Database Design

    15/16

    Global Business Services

    2003 IBM Corporation15

    Recommendations

    Gather what kind of queries need to be done to the database and what arethe response time requirements for each of them.

    Convert the queries into SQL clauses and calculate the expected response times foreach of them

    Verify if the response times are satisfactory. Keep in mind that often more than one SQL-query needs to be done against the database for a single http request.

    Gather the search and sorting criteria

    The search criteria for the different queries should be matching at least for couple ofcolumns. In addition couple of other search criteria based on other columns could beused.

    The order of the results should be sorted based on search criteria, not by additional

    columns not used are search criteria.

    Use caching of data in the presentation layer when possible instead ofdoing table joins

    E.g. customer name vs. customer id, organizational unit name vs. orgunit id

  • 5/26/2018 Importance of Database Design

    16/16

    Global Business Services

    2003 IBM Corporation16

    Recommendations

    Gather characteristics of the database tables and their columns

    Is the data stored in the table permanent or can it be altered?

    How many rows will be stored in the table?

    What is the expected growth rate of the rows in the table or does it remain constant or withincertain boundaries? Will the data be entered in chunks or shall it happen at a constant rate?

    In which columns does the data change and their change frequency and in which it doesnt change

    after it has been entered e.g. task id never changes, classification of the task changes rearly, assigned employee

    changes frequently

    allows to order of the columns in the table to be optimized so that writes to the rollback logs areminimal

    What is the range of possible values in each column and are those values evenly used?

    If the range of values is small and the table is huge, it might not be a good candidate for anindex?

    If the range of values is big, but only 1 or 2 of them are used in 98% of the rows, it might not bea good candidate for an index?

    Which of the columns are nullable and what proportion of the rows contain a null value in thatcolumn

    Depending on the query that is done against the database, it is normally not a good column tobe used in a index