introduction to databases data organisation definition data modelling sql dbms functions

Introduction to Databases

Data Organisation

Definition

Data modelling

SQL

DBMS functions

Basics of data Organisation:

DATA HIERARCHY (four categories)

• Fields = represent a single data item• Records = made up of a related set of

fields describing one instance of an entity• File / Table = a set of related records - as

many as instances (occurrence) in the set• Database = a collection of related files

Example of data structure

Name First name Telephone

Zidane Zinedine 45 25 65 65Feller Joe 25 58 96 63Clinton Bill 12 25 28 89Henry Thierry 25 78 85 85

Fields

Records

File / Table + Other files =>complete dataStructure = DB

"A collection of interrelated data stored together with controlled redundancy, to serve one or more applications in an optimal fashion; the data is stored so that it is independent of the application programs which use it; a common and controlled approach is used in adding new data and in modifying existing data within the database."

Database: Definition.

• A collection of interrelated data stored together• with controlled redundancy• to serve one or more applications in an optimal

fashion• the data is stored so that it is independent of

the application programs which use it• a common and controlled approach is used in

adding new data and in modifying existing data within the database.

Definition - closer look

Advantages of Databases:

• data are independent from applications - stored centrally

• data repository accessible to any new program

• data are not duplicated in different locations

• programmers do not have to write extensive descriptions of the files

• Physical and logical protection is centralised

Disadvantages of DBs:

• Centralisation can be a weakness

• Large DBs require expensive hardware and software

• specialised / scarce personnel is required to develop and maintain large DBs

• Standardisation of data on a central repository has implications for the format in which it is stored

Characteristics of DBs…

• High concurrency (high performance under load)

• Multi-user (read does not interfere with write)• Data consistency – changes to data don’t affect

running queries + no phantom data changes• High degree of recoverability (pull the plug test)

ACID test

• Atomicity

• Consistency

• Isolation

• Durability

All or nothing

Preserve consistency of database

Transactions are independent

Once committed data is preserved

DataBase Management System (DBMS):

• program that makes it possible to:– create– Use (insert / update / delete data)

– maintain a database

• It provides an interface / translation mechanism between the logical organisation of the data stored in the DB and the physical organisation of the data

Using a database:

Two main functions of the DBMS :

• Query language – searching answers in data (SQL)

• Data manipulation language - for programmers who want to modify tha data model in which the data is stored

• + Host Language - the language used by programmers to develop the rest of the application - eg: Oracle developer 2000

Relational DBs:• Data items stored in tables• Specific fields in tables related to other field

in other tables (joint)• infinite number of possible viewpoints on the

data (queries)• Highly flexible DB but overly slow for

complex searches• Oracle, SyBase, Ingres, Access, Paradox for

Windows...

Describing relationships

• Attempt at modelling the business elements (entities) and their relationships (links)

• Can be based on users’ descriptions of the business processes

• Specifies dependencies between the data items

• Coded in an Entity-Relationship Diagram (ERD)

Types of Relationships• one-to-one: one instance of one data item corresponds

to one instance of another

• one-to-many: one instance to many instances

• many-to-many: many instance correspond to many instances

• Also some relationships may be:– compulsory– optional

Example

• Student registering system

• What are the entities?

• What type of relationship do they have?

• Draw the diagram

Entity Relationship Diagram

Example 2 – Sales Order Processing

• Entities

• Relationships

• Use a business object based approach?

Next step - creating the data structure

• Few rules - a lot of experience• Can get quite complex (paramount for the

speed of the DB)• Tables must be normalised - ie redundancy

is limited to the strict minimum by an algorithm

• In practice, normalisation is not always the best

Data Structure Diagrams

• Describe the underlying structure of the DB: the complete logical structure

• Data items are stored in tables linked by pointers– attribute pointers: data fields in one table that will link

it to another (common information)– logical pointers: specific links that exist between

tables

• Tables have a key• Is it an attribute or an entity?

ORDER

order numberItem descriptionItem PriceQuantity orderedCustomer numberItem number

Item

Item numberItem descriptionItem costQuantity on hand

Customer

Customer numberCustomer nameCustomer addressCustomer balanceCustomer special rate

1

2

3

4

* compulsory attributes0 optional attributes

Normalisation

• Process of simplifying the relationships amongst data items as much as possible (see example provided - handout)

• Through an iterative process, structure of data is refined to 1NF, 2NF, 3NF etc.

• Reasons for normalisation:– to simplify retrieval (speed of response)– to simplify maintenance (updates, deletion, insertions)– to reduce the need to restructure the data for each

new application

First Normal Form

• design record structure so that each record looks the same (same length, no repeating groups)

• repetition within a record means one relation was missed = create new relation

• elements of repeating groups are stored as a separate entity, in a separate table

• normalised records have a fixed length and expanded primary key

Second Normal Form

• Record must be in first normal form first• each item in the record must be fully

dependent on the key for identification• Functional dependency means a data

item’s value is uniquely associated with another’s

• only on-to-one relationship between elements in the same file

• otherwise split into more tables

Third normal form

• to remove transitive dependencies• when one item is dependent on an item which

is dependent from the key in the file• relationship is split to avoid data being lost

inadvertently• this will give greater flexibility for the design of

the application + eliminate deletion problems• in practice, 3 NF not used all the time - speed

of retrieval can be affected

Beyond data modeling

• Model must be normalised– Optimised model– “no surprise” model– resilience

• Outcome is a set of tables = logical design• Then, design can be warped until it meets

the realistic constraints of the system• Eg: what business problem are we trying

to solve? – see handout [riccardi p. 113, 127]

Realistic constraints

• Users cannot cope with too many tables

• Too much development required in hiding complex data structure

• Too much administration

• Optimisation is impossible with too many tables

• Actually: RDBs can be quite slow!

Key practical questions

• What are the most important tasks that the DB MUST accomplish efficiently?

• How must the DB be rigged physically to address these?

• What coding practices will keep the coding clean and simple?

• What additional demands arise from the need for resilience and security?

Analysis - Three Levels of Schema

Internal Schema

Logical Schema

External Schema 2 External Schema …External Schema 1

Disk Array

Tables Tables

4 way trade-off

Performance

Clarity of code

Ease of use

Security

Key decisions

• Oracle offers many different ways to do things– Indexes– Backups…

• Good analysis is not only about knowing these => understanding whether they are appropriate

• Failure to think it through => unworkable model• Particularly, predicting performance must be

done properly– Ok on the technical side, tricky on the business side

Design optimisation

• Sources of problems:– Network traffic– Excess CPU usage

• But physical I/O is greatest threat (different from logical I/O)

• Disks still the slowest in the loop• Solution: minimise or re-schedule access• Also try to minimise the impact of Q4 (e.g.

mirroring, internal consistency checks…)

Using scenarios for analysis

• Define standard situation for DB use

• Analyse their specific requirements

• Understand the implications for DB design

• Compare and contrast new problems with old ones

Categories of critical operations

• Manual transaction processing = complex DE by small number of operators

• Automatic transaction processing: large number of concurrent users performing simple DE

• High batch throughput: automatic batch input into DB of very large number of complex transactions

• Data warehousing: large volumes of new data thrown on top every day at fixed intervals + intensive querying

Manual transaction processing

• Insurance telemarketing broker– Data entry– Retrieving reference info– Calculations

• On-line human-computer interaction!!– Instant validation (field by field)– Drop-down lists (DE accelerators)

• Quick response time• Critical issue = user-friendly front end, but

minimise traffic between interface and back end!

Automatic transaction processing

• Large number of user performing simple tasks• Real-time credit card system (e.g. authorisation) or check

out (EPOS)• Human interaction at its most simple – eg typing a code

or swiping a card• Minimum validation, no complex feed back…• Large numbers mean potential problems are:

– Connection opening / closing rate– Contention between concurrent users– SQL engine pbs + data consistency costs

• Design with multiple servers

• Another eg: on-line shopping

• What specific problems would arise from shopping cart type applications?

• How do you handle lost customers?

Automatic transaction processing

High batch throughput• Eg mobile phone network operator• Real time + huge volume of simultaneous complex transactions

– Number checks– Account info– Price info– Pattern checks

• Large processing capacity required + need to tackle all transactions together in batches– DB query may not be only solution (or quickest)– Move customer account to cache– Copy updated figures for accounts to a log and updated accounts in

slack periods (2.5GB an hour!)– Indexing or partitioning for quicker access

“Data warehouse”• Huge store of data• Large volume added every day

– 99% new data, 1% corrections to existing data

• Substantial analysis required prior to development:– What to include– How to aggregate and organise it– Where data comes from

• Real Oracle territory because schedule is lax – ie not a real time application

• Key issues:– Getting partitioning right– Deciding how many summary levels– Deciding what to hold and what to recalulate

Partitioning

• Oldest trick in the book to speed up retrieval (eg?)– Smaller bunch of data– Well labeled so it can be easily found– Smaller index

• Data manipulation – maintenance, copy and protection far easier

• Break down big problem (eg table) into small ones

Internet Databases

• In between types 1 and 2– Many concurrent sessions– Reduced interaction front end back end– Internet = Extra response time (2 secs!)

• In practice, many sites are quite slow• Key issues

– “thin client”– Reduced dialogue– Management of sessions (eg coockies) to avoid

multiple restarts

Conclusion: Key issues

• At one end: very large numbers of small transactions

• Threat of network or process contention• At other end: small number of processes

with complex data crunching and time constraints

• Design of DB and application must reflect these constraints

introduction to databases data organisation definition data modelling sql dbms functions

Documents

data slide

new data

data queries

existing data

data items

data hierarchy

basics of data organisation

data consistency changes