database management system for online

MSc. Information Technology

Amity

University

Database Management

System Semester I

Database Management System is primary ingredients of modern computing systems. Although

database concepts, technology and architectures have been developed and consolidated in the last

three decades, many aspects are subject to technological evolution and revolution. Thus,

developing a study material on this classical and yet continuously evolving field is a great

challenge.

Key features

This study material provides a widespread treatment of databases, dealing with the complete

syllabus for both an introductory course and an advanced course on databases. It offers a

balanced view of concepts, languages and architectures, with concrete reference to current

technology and to commercial database management systems (DBMS). It originates from the

authors experience in teaching, both the UG and PG classes for theory and application.

The study material is composed of seven chapters. Chapter 1 and 2 are designed to expose

students to the fundamental principles of database management and RDBMS concepts. It gives

an idea of how to design a database and develop its schema.Discussion of design techniques

starts with the introduction of the elements of the E-R (Entity-Relationship) model and proceeds

through a well-defined, staged process through conceptual design to the logical design, which

produces a relational schema.

Chapter 3 and 4 are devoted to advanced concepts, including Normalization, Functional

Dependency and use of structure query language required for mastering database technology.

Chapter 5 describes the fundamental and advance concept of procedural query language

commonly known as PL SQL. It improves the power of structure query language. PL/SQL

technology is like an engine that executes PL/SQL blocks and subprograms. This engine can be

started in Oracle server or in application development tools such as Oracle Forms, Oracle

Reports etc.

Chapter 6 and 7 is focusing on many advance concepts of Database systems including the

concept of Transaction Management, Concurrency Control Technology and Backup and

Recovery methods of database system.

Updated Syllabus

Course Contents: Model I: Introduction to DBMS

Introduction to DBMS, Architecture of DBMS, Components of DBMS, Traditional data Models

(Network, Hierarchical and Relational), Database Users, Database Languages, Schemas and

Instances, Data Independence

Module II: Data Modeling

Entity sets attributes and keys, Relationships (ER), Database modeling using entity, Weak and

Strong entity types, Enhanced entity-relationship (EER), Entity Relationship Diagram Design of

an E-R Database schema, Object modeling, Specialization and generalization

Module III: Relational Database Model

Basic Definitions, Properties of Relational Model, Keys, Constraints, Integrity rules, Relational

Algebra, Relational Calculus.

Module IV: Relational Database Design

Functional Dependencies, Normalization, Normal forms (1st, 2nd, 3rd,BCNF), Lossless

decomposition, Join dependencies, 4th & 5th Normal form.

Module V: Query Language

SQL Components (DDL, DML, DCL), SQL Constructs (Select…from…where…. group by….

having…. order by…), Nested tables, Views, correlated query, Objects in Oracle.

Module VI: PL/SQL

Introduction, Basic block, Structure of PL/SQL program, Control Statements, Exception

handling, Cursor Concept, Procedure, functions and triggers.

Module VII: Database Security and Authorization

Basic security issues, Discretionary access control, Mandatory access control, Statistical

database security.

Module VIII: Transaction Management and Concurrency Control Techniques

Transaction concept, ACID properties, Schedules and recoverability, Serial and Non-serial

schedules, Serializability, Concurrency Techniques: Locking Protocols, Timestamping Protocol,

Multiversion Technique, Deadlock Concept - detection and resolution.

Module IX: Backup and Recovery

Database recovery techniques based on immediate and deferred update, ARIES recovery

algorithm, Shadow pages and Write-ahead Logging

Text & References:

Text:

Fundamental of Database Systems, Elmasri & Navathe, Pearson Education, Asia

Data Base Management System, Leon & Leon, Vikas Publications

Database System Concepts, Korth & Sudarshan, TMH

References:

Introduction to Database Systems, Bipin C Desai, Galgotia

Oracle 9i The Complete Reference, Oracle Press

Index:

Chapter Page No.

Introduction to dbms and data modeling 6

Relational database model 34

Functional dependency and normalization 49

Structure query language 64

Procedural query language 78

Transaction management & concurrency conyrol technique 106

Database recovey, backup & security 138

Chapter-1

INTRODUCTION TO DBMS AND DATA MODELING

1. Introductory Concepts

Data: - Data is Collection of facts, upon which a conclusion is based. (Information or

knowledge has value, data has cost). Data can be represented in terms of numbers, characters,

pictures, sounds and figures

Data item: - Smallest named unit of data that has meaning in the real world (examples: last

name, Locality, STD_Code )

Database: - Interrelated collection of data that serves the needs of multiple users within one or

more organizations, i.e. interrelated collections of records of potentially many types.

Database administrator (DBA):- A person or group of person responsible for the effective

use of database technology in an organization or enterprise. DBA is said to be custodian or

owner of Database.

Database Management System: - DBMS is a logical collection of software programs which

facilitates large, structured sets of data to be stored, modified, extracted and manipulated in

different ways. Database Management System (DBMS) also provides security features that

protect against unauthorized users trying to gain access to confidential information and prevent

data loss in case of a system crash. Depending on the specific user’s requirement, users are

allowed access to either all, or specific database subschema, through the use of passwords.

DBMS is also responsible for the database’s integrity, ensuring that no two users are able to

update the same record at the same time, as well as preventing duplicate entries, such as two

employees being given the same employee number.

The following are examples of database applications:

1. Computerized library systems.

2. Automated teller machines.

3. Airline reservation systems.

4. Inventory Management systems.

There are innumerable numbers of Database Management System (DBMS) Software available in

the market. Some of the most popular ones include Oracle, IBM’s DB2, Microsoft Access,

Microsoft SQL Server, MySQL. MySQL is, one of the most popular database management

systems used by online entrepreneurs is one example of an object-oriented DBMS. Microsoft

Access (another popular DBMS) on the other hand is not a fully object oriented system, even

though it does exhibit certain aspects of it.

Example: A database may contain detailed student information, certain users may only be

allowed access to student names , addresses and Phone number, while others user may be able

to view payment detail of students or marks detail of student. Access and change logs can be

programmed to add even more security to a database, recording the date, time and details of any

user making any alteration to the database.

Furthermore, the Database Management Systems employ the use of a query language and report

writers to interrogate the database and analyze its data. Queries allow users to search, sort, and

analyze specific data by granting users efficient access to the required information.

Example: one would use a query command to make the system retrieve data regarding all

courses of a particular department. The most common query language used to access database

systems is the Structured Query Language (SQL).

2. Objectives of Database Management:

Data availability—makes an integrated collection of data available to a wide variety of users

* At reasonable cost—performance in query update, eliminate or control data redundancy

* In meaningful format—data definition language, data dictionary

* Easy access—query language (4GL, SQL, forms, windows, menus);

Data integrity—insure correctness and validity

* Primary Key Constraint / Foreign Key Constraints / Check Constraints.

* Concurrency control and multi-user updates

http://www.bestpricecomputers.co.uk/glossary/data-mining-software.htm

* Audit trail.

Privacy (the goal) and security (the means)

* Schema/ Sub-schema,

* Passwords

Management control—DBA: lifecycle control, training, maintenance

Data independence (a relative term) -- Avoids reprogramming of applications, allows easier

conversion and reorganization of data.

Physical data independence: Application program is unaffected by changes in the storage

structure or physical method of data accessing.

Logical data independence: Application program unaffected by changes in the logical schema

3. Database Models: Database information normally consists of subjects, such as customers,

employees or suppliers; as well as activities such as orders, payments or purchases. This

information must be organized into related record types through a process known as database

design. The DBMS that is chosen must be able to manage different relationships, which is where

database models come in.

3.1 Hierarchical databases organize data under the premise of a basic parent/child relationship.

Each parent can have many children, but each child can only have one parent. In hierarchical

databases, attributes of specific records are listed under an entity type and entity types are

connected to each other through one-to-many relationships, also known as 1:N mapping.

Originally, hierarchical relationships were most commonly used in mainframe systems, but with

the advent of increasingly complex relationship systems, they have now become too restrictive

and are thus rarely used in modern databases. If any of the one-to-many relationships are

compromised, for e.g. an employee having more than one manager, the database structure

switches from hierarchical to a network.

3.2 Network model: In the network model of a database it is possible for a record to have

multiple parents, making the system more flexible compared to the strict single-parent model of

the hierarchical database. The model is made to accommodate many to many relationships,

which allows for a more realistic representation of the relationships between entities. Even

though the network database model enjoyed popularity for a short while, it never really lifted of

the ground in terms of staging a revolution. It is now rarely used because of the availability of

more competitive models that boast the higher flexibility demanded in today’s ever advancing

age.

3.3 Relational databases (RDBMS) are completely unique when compared to the

aforementioned models as the design of the records is organized around a set of tables (with

unique identifiers) to represent both the data and their relationships. The fields to be used for

matching are often indexed in order to speed up the process and the data can be retrieved and

manipulated in a number of ways without the need to reorganize the original database tables.

Working under the assumption that file systems (which often use the hierarchical or network

models) are not considered databases, the relational database model is the most commonly used

system today. While the concepts behind hierarchical and network database models are older

than that of the relational model, the latter was in fact the first one to be formally defined.

After the relational DBMS soared to popularity, the most recent development in DMBS

technology came in the form of the object-oriented database model, which offers more flexibility

than the hierarchical, network and relational models put together. Under this model, data exists

in the form of objects, which include both the data and the data’s behavior. Certain modern

information systems contain such convoluted combinations of information that traditional data

models (including the RDBMS) remain too restrictive to adequately model this complex data.

The object-oriented model also exhibits better cohesion and coupling than prior models, resulting

in a database which is not only more flexible and more manageable but also the most able when

it comes to modeling real-life processes. However, due to the immaturity of this model, certain

problems are bound to arise, some major ones being the lack of an SQL equivalent as well as

lack of standardization. Furthermore, the most common use of the object oriented model is to

have an object point to the child or parent OID (object I.D.) to be retrieved; leaving many

programmers with the impression that the object oriented model is simply a reincarnation of the

network model at best. That is, however, an attempt at the over-simplification of an innovative

technology.

4. Components of a DBMS

Components of a Data Base Management System (DBMS) is well illustrated by the diagram

shown bellow.

4.1. Database Engine: Database Engine is the foundation for storing, processing, and securing

data. The Database Engine provides controlled access and rapid transaction processing to meet the

requirements of the most demanding data consuming applications within your enterprise. Use the

Database Engine to create relational databases for online transaction processing or online analytical

processing data. This includes creating tables for storing data, and database objects such as

indexes, views, and stored procedures for viewing, managing, and securing data. You can use SQL

Server Management Studio to manage the database objects, and SQL Server Profiler for capturing

server events.

4.2. Data dictionary: A data dictionary is a reserved space within a database which is used to store

information about the database itself. A data dictionary is a set of table and views which can only

be read and never altered. Most data dictionaries contain different information about the data used

in the enterprise. In terms of the database representation of the data, the data table defines all

schema objects including views, tables, clusters, indexes, sequences, synonyms, procedures,

packages, functions, triggers and many more. This will ensure that all these things follow one

standard defined in the dictionary. The data dictionary also defines how much space has been

allocated for and / or currently in used by all the schema objects. A data dictionary is used when

finding information about users, objects, schema and storage structures. Every time a data

definition language (DDL) statement is issued, the data dictionary becomes modified.

A data dictionary may contain information such as:

Database design information

Stored SQL procedures

User permissions

User statistics

Database process information

Database growth statistics

Database performance statistics

4.3. Query Processor: A relational database consists of many parts, but at its heart are two major

components: the storage engine and the query processor. The storage engine writes data to and

reads data from the disk. It manages records, controls concurrency, and maintains log files. The

query processor accepts SQL syntax, selects a plan for executing the syntax, and then executes the

chosen plan. The user or program interacts with the query processor, and the query processor in

turn interacts with the storage engine. The query processor isolates the user from the details of

execution: The user specifies the result, and the query processor determines how this result is

obtained. The query processor components include

DDL interpreter

DML compiler

Query evaluation engine

4.4. Report writer: Also called a report generator, a program, usually part of a database

management system that extracts information from one or more files and presents the information

in a specified format. Most report writers allow you to select records that meet certain conditions

and to display selected fields in rows and columns. You can also format data into pie charts, bar

charts, and other diagrams. Once you have created a format for a report, you can save the format

specifications in a file and continue reusing it for new data.

5. Database Languages

5.1 Data Definition Language (DDL): Data Definition Language (DDL). It is use to define the

structure of a Database. The database structure definition (Schema) typically includes the

following:

Defining all data element, Defining data element field and records, Defining the name, field

length, and field type for each data type, Defining control for field that can have only selective

values.

Typical DDL operations (with their respective keywords in the structured query language SQL):

Creation of tables and definition of attributes (CREATE TABLE ...)

Change of tables by adding or deleting attributes (ALTER TABLE …)

Deletion of whole table including content (DROP TABLE …)

5.2 Data Manipulation Language (DML): Data Manipulation Language (DML) Once the

structure is defined the database is ready for entry and manipulation of data. Data Manipulation

Language (DML) includes the command to enter and manipulate the Data, with these commands

the user can Add new records, navigate through the existing records, view contents of various

fields, modify the data, delete the existing record, sort the record in desired sequence. Typical

DML operations (with their respective keywords in the structured query language SQL):

Add data (INSERT)

Change data (UPDATE)

Delete data (DELETE)

Query data (SELECT)

5.3 Data Control Language (DCL): Data control commands in SQL control access privileges

and security issues of a database system or parts of it. These commands are closely related to the

DBMS (Database Management System) and can therefore vary in different SQL

implementations. Some typical commands are:

GRANT - give user access privileges to a database

REVOKE withdraws access privileges given with the GRANT or taken with the DENY

command

Since these commands depend on the actual database management system (DBMS), we will not

cover DCL in this module.

6. Database USER

6.1 Database Administrator (DBA): The DBA is a person or a group of persons who is

responsible for the management of the database. The DBA is responsible for authorizing access

to the database by grant and revoke permissions to the users, for coordinating and monitoring its

use, managing backups and repairing damage due to hardware and/or software failures and for

acquiring hardware and software resources as needed. In case of small organization the role of

DBA is performed by a single person and in case of large organizations there is a group of

DBA's who share responsibilities.

6.2 Database Designers: They are responsible for identifying the data to be stored in the

database and for choosing appropriate structure to represent and store the data. It is the

responsibility of database designers to communicate with all prospective of the database users in

order to understand their requirements so that they can create a design that meets their

requirements.

6.3 End Users: End Users are the people who interact with the database through applications or

utilities. The various categories of end users are:

• Casual End Users - These Users occasionally access the database but may need

different information each time. They use sophisticated database Query language to specify their

requests. For example: High level Managers who access the data weekly or biweekly.

• Native End Users - These users frequently query and update the database using standard

types of Queries. The operations that can be performed by this class of users are very limited and

effect precise portion of the database.

For example: - Reservation clerks for airlines/hotels check availability for given request and

make reservations. Also, persons using Automated Teller Machines (ATM's) fall under this

category as he has access to limited portion of the database.

• Standalone end Users/On-line End Users - Those end Users who interact with the

database directly via on-line terminal or indirectly through Menu or graphics based Interfaces.

For example: - User of a text package, library management software that store variety of library

data such as issue and return of books for fine purposes.

6.4 Application Programmers

Application Programmers are responsible for writing application programs that use the database.

These programs could be written in General Purpose Programming languages such as Visual

Basic, Developer, C, FORTRAN, COBOL etc. to manipulate the database. These application

programs operate on the data to perform various operations such as retaining information,

creating new.

7. ADVANTAGES OF DBMS

The DBMS (Database Management System) is preferred over the conventional file

processing system due to the following advantages:

Controlling Data Redundancy - In the conventional file processing system, every user group

maintains its own files for handling its data files. This may lead to

Duplication of same data in different files.

Wastage of storage space.

Errors may be generated due to updating of the same data in different files.

Time in entering data again and again is wasted.

Computer Resources are needlessly used.

It is very difficult to combine information.

The entire above mentioned problem was eliminated in Database Management System.

Elimination of Inconsistency - In the file processing system information is duplicated

throughout the system. So changes made in one file may be necessary be carried over to another

file. This may lead to inconsistent data. So we need to remove this duplication of data in multiple

file to eliminate inconsistency.

For example: - Let us consider an example of student's result system. Suppose that in

STUDENT file it is indicated that Roll no= 10 has opted for 'Computer' course but in RESULT

file it is indicated that 'Roll No. =l0' has opted for 'Accounts' course. Thus, in this case the two

entries for particular student don't agree with each other. Thus, database is said to be in an

inconsistent state. Science to eliminate this conflicting information we need to centralize the

database. On centralizing the data base the duplication will be controlled and hence

inconsistency will be removed. Data inconsistency are often encountered in every day life

Consider an another example, we have all come across situations when a new address is

communicated to an organization that we deal it (Eg - Telecom, Gas Company, Bank). We find

that some of the communications from that organization are received at a new address while

other continued to be mailed to the old address. So combining all the data in database would

involve reduction in redundancy as well as inconsistency so it is likely to reduce the costs for

collection storage and updating of Data.

Better service to the users - A DBMS is often used to provide better services to the users. In

conventional system, availability of information is often poor, since it normally difficult to

obtain information that the existing systems were not designed for. Once several conventional

systems are combined to form one centralized database, the availability of information and its

update ness is likely to improve since the data can now be shared and DBMS makes it easy to

respond to anticipated information requests.

Centralizing the data in the database also means that user can obtain new and combined

information easily that would have been impossible to obtain otherwise. Also use of DBMS

should allow users that don't know programming to interact with the data more easily, unlike file

processing system where the programmer may need to write new programs to meet every new

demand.

Flexibility of the System is improved - Since changes are often necessary to the contents of the

data stored in any system, these changes are made more easily in a centralized database than in a

conventional system. Applications programs need not to be changed on changing the data in the

database. This will also maintain the consistency and integrity of data into the database.

Integrity can be improved - Since data of the organization using database approach is

centralized and would be used by a number of users at a time. It is essential to enforce integrity-

constraints.

In the conventional systems because the data is duplicated in multiple files so updating or

changes may sometimes lead to entry of incorrect data in some files where it exists.

For example: - The example of result system that we have already discussed. Since multiple files

are to maintained, as sometimes you may enter a value for course which may not exist. Suppose

course can have values (Computer, Accounts, Economics, and Arts) but we enter a value 'Hindi'

for it, so this may lead to an inconsistent data, so lack of Integrity. Even if we centralized the

database it may still contain incorrect data. For example: -

• Salary of full time employ may be entered as Rs. 500 rather than Rs. 5000.

• A student may be shown to have borrowed books but has no enrollment.

• A list of employee numbers for a given department may include a number of non existent

employees. These problems can be avoided by defining the validation procedures whenever any

update operation is attempted.

Standards can be enforced - Since all access to the database must be through DBMS, so

standards are easier to enforce. Standards may relate to the naming of data, format of data,

structure of the data etc. Standardizing stored data formats is usually desirable for the purpose of

data interchange or migration between systems.

Security can be improved - In conventional systems; applications are developed in an adhoc or

temporary manner. Often different system of an organization would access different components

of the operational data, in such an environment enforcing security can be quiet difficult. Setting

up of a database makes it easier to enforce security restrictions since data is now centralized. It is

easier to control who has access to what parts of the database. Different checks can be

established for each type of access (retrieve, modify, delete etc.) to each piece of information in

the database.

Consider an Example of banking in which the employee at different levels may be given access

to different types of data in the database. A clerk may be given the authority to know only the

names of all the customers who have a loan in bank but not the details of each loan the customer

may have. It can be accomplished by giving the privileges to each employee.

Organization's requirement can be identified - Organizations have sections and departments

and each of these units often consider the work of their unit as the most important and therefore

consider their need as the most important. Once a database has been setup with centralized

control, it will be necessary to identify organization's requirement and to balance the needs of the

different units. So it may become necessary to ignore some requests for information if they

conflict with higher priority need of the organization. It is the responsibility of the DBA

(Database Administrator) to structure the database system to provide the overall service that is

best for an organization.

For example: - A DBA must choose best file Structure and access method to give fast response

for the high critical applications as compared to less critical applications.

Overall cost of developing and maintaining systems is lower - It is much easier to respond to

unanticipated requests when data is centralized in a database than when it is stored in a

conventional file system. Although the initial cost of setting up of a database can be large, one

normal expects the overall cost of setting up of a database, developing and maintaining

application programs to be far lower than for similar service using conventional systems, Since

the productivity of programmers can be higher in using non-procedural languages that have been

developed with DBMS than using procedural languages.

Data Model must be developed - Perhaps the most important advantage of setting up of

database system is the requirement that an overall data model for an organization be build. In

conventional systems, it is more likely that files will be designed as per need of particular

applications demand. The overall view is often not considered. Building an overall view of an

organization's data is usual cost effective in the long terms.

Provides backup and Recovery - Centralizing a database provides the schemes such as

recovery and backups from the failures including disk crash, power failures, software errors

which may help the database to recover from the inconsistent state to the state that existed prior

to the occurrence of the failure, though methods are very complex.

8. Three-Schemes Architecture

The objective of Three-Schemes Architecture is to separate the user application program and the

physical database. The Three schema architecture is an effective tool with which the user can

visualize the schema levels in a database system. The three levels ANSI architecture has an

important place in database technology development because it clearly separates the users’

external level, the system’s conceptual level, and the internal storage level for designing a

database. In three-schemas architecture schemas can be defined at three different levels.

8.1 External Scheme:

An external scheme describes the specific user’s view of data. and the specific methods

and constraints connected with this information.. Each external schema describes the part

of the part of the database that a particular user group is interested in and hides the rest of

the database from that database from that database.

8.2 Internal Scheme:

The Internal scheme mainly describes the physical storage structure of the database.

Internal scheme describes the data from a view very close to the computer or system in

general. It completes the logical scheme with data technical aspects like storage methods

or help functions for more efficiency.

8.3 Conceptual Schema: It describes the structure of the whole database for the entire user

community. The conceptual schema hides the details of physical storage structure and

concentrates on describing entities, data types, relationships and constraints. This

implementation of conceptual schema is based on conceptual schema design in a high level data

model.

9. Data Independence:

With knowledge about the three-scheme architecture the term data independence can be

explained as followed: Each higher level of the data architecture is immune to changes of the

next lower level of the architecture.

Data independence is normally thought of in terms of two levels or types. Logical data

independence makes it possible to change the structure of the data independently without

modifying the application programs that make use of the data. There is no need to rewrite current

applications as part of the process of adding to or removing data from then system.

The second type or level of data independence is known as physical data independence. This

approach has to do with altering the organization or storage procedures related to the data, rather

than modifying the data itself. Accomplishing this shift in file organization or the indexing

strategy used for the data does not require any modification to the external structure of the

applications, meaning that users of the applications are not likely to notice any difference at all in

the function of their programs.

Database Instance: The term instance is typically used to describe a complete database

environment, including the RDBMS software, table structure, stored procedures and other

functionality. It is most commonly used when administrators describe multiple

instances of the same database. Also Known As: environment

Examples: An organization with an employee’s database might have three different

instances: production (used to contain live data), pre-production (used to test new

functionality prior to release into production) and development (used by database developers to

create new functionality).

Relational Schema: A relation schema can be thought of as the basic information describing a

table or relation. This includes a set of column names, the data types associated with each

column, and the name associated with the entire table.

10. Entity - Relationship Model

The Entity - Relationship Model (E-R Model) is a high-level conceptual data model developed

by Chen in 1976 to facilitate database design. Conceptual Modeling is an important phase in

designing a flourishing database. A conceptual data model is a set of concepts that describe the

structure of a database and associated retrieval and update transactions on the database. A high

level model is chosen so that all the technical aspects are also covered. The E-R data model grew

out of the exercise of using commercially available DBMS to model the database. The E-R

model is the generalization of the earlier available commercial models like the Hierarchical and

the Network Model. It also allows the representation of the various constraints as well as their

relationships.

So to sum up, the Entity-Relationship (E-R) Model is based on a view of a real world that

consists of set of objects called entities and relationships among entity sets which are basically a

group of similar objects. The relationships between entity sets is represented by a named E-R

relationship and is of 1:1, 1: N or M: N type which tells the mapping from one entity set to

another.

The E-R model is shown diagrammatically using Entity-Relationship (E-R) diagrams

which represent the elements of the conceptual model that show the meanings and the

relationships between those elements independent of any particular DBMS and implementation

details.

10.1 What are Entity Relationship Diagrams?

Entity Relationship Diagrams (ERD) illustrates the logical structure of databases.

An ER Diagram

10.2 Entity Relationship Diagram Notations

Entity

An entity is an real world objects (living or non living) or concept about which you want to

store information..

Weak Entity

A weak entity is an entity that must defined by a foreign key relationship with another entity as it

cannot be uniquely identified by its own attributes alone.

Key attribute

A key attribute is the unique, distinguishing characteristic of the entity, which can uniquely

identify the instances of entity set.. For example, an employee's social security number might be

the employee's key attribute.

Multi valued attribute

A multi valued attribute can have more than one value. For example, an employee entity can

have multiple skill values.

Derived attribute

A derived attribute is based on another attribute. For example, an employee's monthly salary is

based on the employee's annual salary.

Relationships

Relationships illustrate how two entities share information in the database structure.

First, connect the two entities, then drop the relationship notation on the line.

http://www.smartdraw.com/resources/tutorials/Drawing-ER-Diagrams-1

Cardinality

Cardinality specifies how many instances of an entity relate to one instance of another entity.

ordinality is also closely linked to cardinality. While cardinality specifies the occurrences of a

relationship, ordinality describes the relationship as either mandatory or optional. In other words,

cardinality specifies the maximum number of relationships and ordinality specifies the absolute

minimum number of relationships.

Recursive relationship

In some cases, entities can be self-linked. For example, employees can supervise other

employees.

10.3 How to design an Effective ER Diagrams

1) Make sure that each entity only appears once per diagram.

2) Name every entity, relationship, and attribute on your diagram.

3) Examine relationships between entities closely. Are they necessary? Are there any

relationships missing? Eliminate any redundant relationships. Don't connect relationships to each

other.

4) Use colors to highlight important portions of your diagram.

Using colors can help you highlight important features in your diagram

5) Create a polished diagram by adding shadows and color. You can choose from a number of

ready-made styles in the Edit menu under Colors and Shadows, or you can create your own.

10.4 Features of the E-R Model:

1. The E-R diagram used for representing E-R Model can be easily converted into Relations

(tables) in Relational Model.

2. The E-R Model is used for the purpose of good database design by the database developer so

to use that data model in various DBMS.

3. It is helpful as a problem decomposition tool as it shows the entities and the relationship

between those entities.

4. It is inherently an iterative process. On later modifications, the entities can be inserted into this

model.

5. It is very simple and easy to understand by various types of users and designers because

specific standards are used for their representation.

11. Enhanced Entity Relationship (EER) Diagrams

It Contain all the essential modeling concepts of an ER Diagram

Adds extra concepts:

o Specialization/generalization

o Subclass/super class

o Categories

o Attribute inheritance

Extended ER diagrams use some object-oriented concepts such as inheritance.

EER is used to model concepts more accurately than the ER diagram.

Sub classes and Super classes

In some cases, and entity type has numerous sub-groupings of its entities that are meaningful,

and need to be explicitly represented, because of their importance.

For example, members of entity Employee can be grouped further into Secretary, Engineer,

Manager, Technician, Salaried_Employee.

The set listed is a subset of the entities that belong to the Employee entity, which means that

every entity that belongs to one of the sub sets is also an Employee.

Each of these sub-groupings is called a subclass, and the Employee entity is called the super-

class.

Employee

Secretary Technician Engineer

d

Work

For

Department

An entity cannot only be a member of a subclass; it must also be a member of the super-

class.

An entity can be included as a member of a number of sub classes, for example, a Secretary

may also be a salaried employee, however not every member of the super class must be a

member of a sub class.

Type Inheritance

The type of an entity is defined by the attributes it possesses, and the relationship types it

participates in.

Because an entity in a subclass represents the same entity from the super class, it should

possess all the values for its attributes, as well as the attributes as a member of the super

class.

This means that an entity that is a member of a subclass inherits all the attributes of the entity

as a member of the super class; as well, an entity inherits all the relationships in which the

super class participates.

Specialization

The process of defining a set of subclasses of a super class.

Specialization is the top-down refinement into (super) classes and subclasses

The set of sub classes is based on some distinguishing characteristic of the super class.

For example, the set of sub classes for Employee, Secretary, Engineer, Technician,

differentiates among employee based on job type.

There may be several specializations of an entity type based on different distinguishing

characteristics.

Another example is the specialization, Salaried_Employee and Hourly_Employee, which

distinguish employees based on their method of pay.

Notation for Specialization

To represent a specialization, the subclasses that define a specialization are attached by lines

to a circle that represents the specialization, and is connected to the super class.

The subset symbol (half-circle) is shown on each line connecting a subclass to a super class,

indicates the direction of the super class/subclass relationship.

Attributes that only apply to the sub class are attached to the rectangle representing the

subclass. They are called specific attributes.

A sub class can also participate in specific relationship types. See Example.

Reasons for Specialization

Certain attributes may apply to some but not all entities of a super class. A subclass is

defined in order to group the entities to which the attributes apply.

The second reason for using subclasses is that some relationship types may be participated in

only by entities that are members of the subclass.

Employee

Secretary Technician Engineer

d

Work

For

Department

Belongs

To Professional

Organization

Summary of Specialization

Allows for:

Defining set of subclasses of entity type

Create additional specific attributes for each sub class

Create additional specific relationship types between each sub class and other entity types or

other subclasses.

Generalization

The reverse of specialization is generalization.

Several classes with common features are generalized into a super class.

For example, the entity types Car and Truck share common attributes License_PlateNo,

VehicleID and Price, therefore they can be generalized into the super class Vehicle.

Constraints on Specialization and Generalization

Several specializations can be defined on an entity type.

Entities may belong to subclasses in each of the specializations.

The specialization may also consist of a single subclass, such as the manager specialization;

in this case we don’t use the circle notation.

Types of Specializations

Predicate-defined or Condition-defined specialization

Occurs in cases where we can determine exactly the entities of each sub class by placing a

condition of the value of an attribute in the super class.

An example is where the Employee entity has an attribute, Job Type. We can specify the

condition of membership in the Secretary subclass by the condition, JobType=”Secretary”

Example:

The condition is called the defining predicate of the sub class.

The condition is a constraint specifying exactly those entities of the Employee entity type

whose attribute value for Job Type is Secretary belong to the subclass.

Predicate defined subclasses are displayed by writing the predicate condition next to the line

that connects the subclass to the specialization circle.

Attribute-defined specialization

If all subclasses in a specialization have their membership condition on the same attribute of

the super class, the specialization is called an attribute-defined specialization, and the

attribute is called the defining attribute.

Attribute-defined specializations are displayed by placing the defining attribute name next to

the arc from the circle to the super class.

User-defined specialization

When we do not have a condition for determining membership in a subclass the subclass is

called user-defined.

Membership to a subclass is determined by the database users when they add an entity to the

subclass.

Dis-jointness / Overlap Constraint

Specifies that the subclass of the specialization must be disjoint, which means that an entity

can be a member of, at most, one subclass of the specialization.

The d in the specialization circle stands for disjoint.

If the subclasses are not constrained to be disjoint, they overlap.

Overlap means that an entity can be a member of more than one subclass of the

specialization.

Overlap constraint is shown by placing an o in the specialization circle.

Completeness Constraint

The completeness constraint may be either total or partial.

A total specialization constraint specifies that every entity in the super class must be a

member of at least one subclass of the specialization.

Total specialization is shown by using a double line to connect the super class to the circle.

A single line is used to display a partial specialization, meaning that an entity does not have

to belong to any of the subclasses.

Disjointness vs. Completeness

Disjoint constraints and completeness constraints are independent. The following possible

constraints on specializations are possible:

Disjoint, total

Disjoint, partial

Department

d

Academic Administrative

Employee

d

Secretary Analyst Engineer

Overlapping, total

Overlapping, partial

Part

o

Manufactured Puchased

Movie

o

Children Comedy Drama

Chapter-I

INTRODUCTION TO DBMS AND DATA MODELING.

End Chapter quizzes:

Q.1. Entity is represented by the symbol

(a) Circle

(b) Ellipse

(c) Rectangle

(d) Square

Q2. A relationship is (a) an item in an application

(b) a meaningful dependency between entities

(c) a collection of related entities

(d) related data

Q3. Overall logical structure of a database can be expressed graphically by

a). ER diagram

b). Records

c). Relations

d). Hierarchy.

Q4. In three schemas architecture a specific view of data given to a particular user is defined at

a) Internal Level

b) External Level

c) Conceptual Level

d) Physical Level

Q5. By data redundancy in a file based system we mean that

(a) Unnecessary data is stored

(b) Same data is duplicated in many files

(c) Data is unavailable

(d) Files have redundant data

Q6. Entities are identified from the word statement of a problem by (a) picking words which are adjectives

(b) picking words which are nouns

(c) picking words which are verbs

(d) picking words which are pronouns

Q7. Data independence allows (a) sharing the same database by several applications

(b) extensive modification of applications

(c) no data sharing between applications

(d) elimination of several application programs

Q8. Access right to a database is controlled by

(a) top management

(a) system designer

(b) system analyst

(c) database administrator

Q9. Data integrity in a file based system may be lost because (a) the same variable may have different values in different files

(b) files are duplicated

(c) unnecessary data is stored in files

(d) redundant data is stored in files

Q10. Characteristics of an entity set is known as:

a) Attributes

b) Cardinality

c) Relationship

d) Many to Many Relation

Q11. Vehicle identification number, color, weight, and horsepower best exemplify:

a.) entities.

b.) entity types.

c.) data markers.

d.) attributes.

Q12. If each employee can have more than one skill, then skill is referred to as a:

a.) gerund.

b.) multivalued attribute.

c.) nonexclusive attribute.

d.) repeating attribute

Q13. The data structure used in the hierarchical model is

a) Tree

b) Graph

c) Table

d) None of these.

Q14. By data security in DBMS we mean (a) preventing access to data

(b) allowing access to data only to authorized users

(c) preventing changing data

(d) introducing integrity constraints

Chapter-2

RELATIONAL DATABASE MODEL

2. Introductory Concepts

Relational Database Management System

A Relational Database Management System (RDBMS) provides a complete and integrated move

towards information management. A relational model provides the basis for a relational

database. A relational model has three aspects:

Structures

Operations

Integrity rules

Structures consist of a collection of objects or relations that store data. An example of relation is

a table. You can store information in a table and use the table to retrieve and modify data.

Operations are used to manipulate data and structures in a database. When using operations.

You must stick to a predefined set of integrity rules.

Integrity rules are laws that govern the operations allowed on data in a database. This ensures

data accuracy and consistency.

Relational database components include:

Table

Row

Column

Field

Primary key

Foreign key

Figure Relational database components

A Table is a basic storage structure of an RDBMS and consists of columns and rows. A table

represents an entity. For example, the S_DEPT table stores information about the departments of

an organization.

A Row is a combination of column values in a table and is identified by a primary key. Rows are

also known as records. For example, a row in the table S_DEPT contains information about one

department.

A Column is a collection of one type of data in a table. Columns represent the attributes of an

object. Each column has a column name and contains values that are bound by the same type and

size. For example, a column in the table S_DEPT specifies the names of the departments in the

organization.

A Field is an intersection of a row and a column. A field contains one data value. If there is no

data in the field, the field is said to contain a NULL value.

Figure Table, Row, Column & Field

A Primary key is a column or a combination of columns that is used to uniquely identify each

row in a table. For example, the column containing department numbers in the S_DEPT table is

created as a primary key and therefore every department number is different. A primary key must

contain a value. It cannot contain a NULL value.

A Foreign key is a column or set of columns that refers to a primary key in the same table or

another table. You use foreign keys to establish principle connections between, or within, tables.

A foreign key must either match a primary key or else be NULL. Rows are connected logically

when required. The logical connections are based upon conditions that define a relationship

between corresponding values, typically between a primary key and a matching foreign key. This

relational method of linking provides great flexibility as it is independent of physical links

between records.

Figure Primary & Foreign key

RDBMS Properties

An RDBMS is easily accessible. You execute commands in the Structured Query Language

(SQL) to manipulate data. SQL is the international Standards Organization (ISO) standard

language for interacting with a RDBMS.

An RDBMS provides full data independence. The organization of the data is independent of the

applications that use it. You do not need to specify the access routes to tables or know how data

is physically arranged in a database.

A relational database is a collection of individual, named objects. The basic unit of data storage

in a relational database is called a table. A table consists of rows and columns used to store

values. For access purpose, the order of rows and columns is insignificant. You can control the

access order as required.

Figure SQL & Database

When querying the database, you use conditional operations such as joins and restrictions. A join

combines data from separate database rows. A restriction limits the specific rows returned by a

query.

Figure Conditional operations

An RDBMS enables data sharing between users. At the same time, you can ensure consistency

of data across multiple tables by using integrity constraints. An RDBMS uses various types of

data integrity constraints. These types include entity, column, referential and user-defined

constraints.

The constraint, entity, ensures uniqueness of rows, and the constraint column ensures

consistency of the type of data within a column. The other type, referential, ensures validity of

foreign keys, and user-defined constraints are used to enforce specific business rules.

An RDBMS minimizes the redundancy of data. This means that similar data is not

3. Codd's 12 rules

Codd's 12 rules are a set of twelve rules proposed by E. F. Codd, a pioneer of the relational

model for databases, designed to define what is required from a database management system in

order for it to be considered relational, i.e., an RDBMS. Codd produced these rules as part of a

personal campaign to prevent his vision of the relational database being diluted.

Rule 1: The information rule:

All information in the database is to be represented in one and only one way, namely by values

in column positions within rows of tables.

Rule 2: The guaranteed access rule:

All data must be accessible with no ambiguity. This rule is essentially a restatement of the

fundamental requirement for primary keys. It says that every individual scalar value in the

database must be logically addressable by specifying the name of the containing table, the name

of the containing column and the primary key value of the containing row.

Rule 3: Systematic treatment of null values:

The DBMS must allow each field to remain null (or empty). Specifically, it must support a

representation of "missing information and inapplicable information" that is systematic, distinct

from all regular values (for example, "distinct from zero or any other number", in the case of

numeric values), and independent of data type. It is also implied that such representations must

be manipulated by the DBMS in a systematic way.

Rule 4: Active online catalog based on the relational model:

The system must support an online, inline, relational catalog that is accessible to authorized users

by means of their regular query language. That is, users must be able to access the database's

structure (catalog) using the same query language that they use to access the database's data.

Rule 5: The comprehensive data sublanguage rule:

The system must support at least one relational language that

o Has a linear syntax

o Can be used both interactively and within application programs,

o Supports data definition operations (including view definitions), data manipulation

operations (update as well as retrieval), security and integrity constraints,

and transaction management operations (begin, commit, and rollback).

Rule 6: The view updating rule:

All views that are theoretically updatable must be updatable by the system.

Rule 7: High-level insert, update, and delete:

The system must support set-at-a-time insert, update, and delete operators. This means that data

can be retrieved from a relational database in sets constructed of data from multiple rows and/or

multiple tables. This rule states that insert, update, and delete operations should be supported for

any retrievable set rather than just for a single row in a single table.

Rule 8: Physical data independence:

Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must

not require a change to an application based on the structure.

Rule 9: Logical data independence:

Changes to the logical level (tables, columns, rows, and so on) must not require a change to an

application based on the structure. Logical data independence is more difficult to achieve than

physical data independence.

Rule 10: Integrity independence:

Integrity constraints must be specified separately from application programs and stored in

the catalog. It must be possible to change such constraints as and when appropriate without

unnecessarily affecting existing applications.

Rule 11: Distribution independence:

The distribution of portions of the database to various locations should be invisible to users of

the database. Existing applications should continue to operate successfully:

o when a distributed version of the DBMS is first introduced; and

o when existing distributed data are redistributed around the system.

Rule 12: The nonsubversion rule:

If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used

to subvert the system, for example, bypassing a relational security or integrity constraint.

3. Data Integrity and Integrity Rules

Data Integrity is very important concepts in database operations in particular and Data

Warehousing and Business Intelligence in general. Because Data Integrity ensured that only data

of high quality, correct, consistent is accessible to its user. The database designer is responsible

for incorporating elements to promote the accuracy and reliability of stored data within the

database. There are many different techniques that can be used to encourage data integrity, with

some of these dependants on what database technology is being used. Here we are discussing

two most common integrity rule.

Integrity rule 1: Entity integrity

It says that no component of a primary key may be null.

All entities must be distinguishable. That is, they must have a unique identification of some kind.

Primary keys perform unique identification function in a relational database. An identifier that

was wholly null would be a contradiction in terms. It would be like there was some entity that

did not have any unique identification. That is, it was not distinguishable from other entities. If

two entities are not distinguishable from each other, then by definition there are not two entities

but only one.

Integrity rule 2: Referential integrity

The referential integrity constraint is specified between two relations and is used to

maintain the consistency among tuples of the two relations.

Suppose we wish to ensure that value that appears in one relation for a given set of attributes also

appears for a certain set of attributes in another. This is referential integrity.

The referential integrity constraint states that, a tuple in one relation that refers to another

relation must refer to the existing tuple in that relation. This means that the referential integrity is

a constraint specified on more than one relation. This ensures that the consistency is maintained

across the relations.

Table A

DeptID DeptName DeptManager

F-1001 Financial Nathan

S-2012 Software Martin

H-0001 HR Jason

Table B

EmpNo DeptID EmpName

1001 F-1001 Tommy

1002 S-2012 Will

1003 H-0001 Jonathan

4. Relational algebra

Relational algebra is a procedural query language, which consists of a set of operations that take

one or two relations as input and produce a new relation as their result. The fundamental

operations that will be discussed in this section are: select, project, union, and set difference.

Besides the fundamental operations, the following additional operations will be discussed: set-

intersection.

Each operation will be applied to tables of a sample database. Each table is otherwise known as a

relation and each row within the table is referred to as a tuple. The sample database consists of

tables in which one might see in a bank. The sample database consists of the following 6

relations:

Account

branch-name account-number balance

Downtown Mianus

Perryridge

Round Hill

Brighton

Redwood

Brighton

A-101 A-215 A-102 A-305 A-201 A-222 A-217

500

700

400

350

900

700

750

Branch

branch-name branch-city assets

Downtown Redwood

Perryridge

Mianus

Round Hill

Pownal

North Town

Brighton

Brooklyn

Palo Alto

Horseneck

Horseneck

Horseneck

Bennington

Rye

Brooklyn

9000000

2100000

1700000

400000

8000000

300000

3700000

7100000

Customer

customer-name customer-street customer-city

Jones

Smith

Hayes

Curry

Lindsay

Turner

Williams

Adams

Johnson

Glenn

Brooks

Green

Main

North

Main

North

Park

Putnam

Nassau

Spring

Alma

Sand Hill

Senator

Walnut

Harrison

Rye

Harrison

Rye

Pittsfield

Stamford

Princeton

Pittsfield

Palo Alto

Woodside

Brooklyn

Stamford

Depositor

customer-name account-number

Johnson

Smith

Hayes

Turner

Johnson

Jones

Lindsay

A-101

A-215

A-102

A-305

A-201

A-217

A-222

Loan

branch-name loan-number amount

Downtown Redwood

Perryridge

Downtown

Mianus

Round Hill

Perryridge

L-17

L-23

L-15

L-14

L-93

L-11

L-16

1000

2000

1500

1500

500

900

1300

Borrower

customer-name loan-number

Jones

Smith

Hayes

Jackson

Curry

Smith

Williams

Adams

L-17

L-23

L-15

L-14

L-93

L-11

L-17

L-16

The Select operation is a unary operation, which means it operates on one relation. Its function is

to select tuples that satisfy a given predicate. To denote selection, the lowercase Greek letter

sigma ( ) is used. The predicate appears as a subscript to . The argument relation is given in

parentheses following the .

For example, to select those tuples of the loan relation where the branch is "Perryridge," we

write:

branch-home = "Perryridge" (loan)

The results of the query are the following:

branch-name loan-number amount

Perryridge

Perryridge L-15

L-16 1500

1300

Comparisons like =, , <, >, can also be used in the selection predicate. An example query using

a comparison is to find all tuples in which the amount lent is more than $1200 would be written:

amount > 1200 (loan)

The project operation is a unary operation that returns its argument relation with certain

attributes left out. Since a relation is a set, any duplicate rows are eliminated. Projection is

denoted by the Greek letter pi ( ). The attributes that wish to be appear in the result are listed as

a subscript to . The argument relation follows in parentheses. For example, the query to list all

loan numbers and the amount of the loan is written as:

Loan-number, amount (loan)

The result of the query is the following:

loan-number amount

L-17

L-23

L-15

1000

2000

1500

L-14

L-93

L-11

L-16

1500

500

900

1300

Another more complicated example query is to find those customers who live in Harrison is

written as:

Customer-name ( customer-city = "Harrison" (customer

The union operation yields the results that appear in either or both of two relations. It is a binary

operation denoted by the symbol .

An example query would be to find the name of all bank customers who have either an account

or a loan or both. To find this result we will need the information in the depositor relation and in

the borrower relation. To find the names of all customers with a loan in the bank we would write:

Customer-name (borrower)

and to find the names of all customers with an account in the bank, we would write:

Customer-name (depositor)

Then by using the union operation on these two queries we have the query we need to obtain the

wanted results. The final query is written as:

Customer-name (borrower) customer-name (depositor) The result of the query is the following:

customer-name

Johnson

Smith

Hayes

Turner

Jones

Lindsay

Jackson

Curry

Williams

Adams

The set intersection operation is denoted by the symbol . It is not a fundamental operation,

however it is a more convenient way to write r - (r - s).

An example query of the operation to find all customers who have both a loan and and account

can be written as:

Customer-name (borrower) customer-name (depositor)

The results of the query are the following:

customer-name

Hayes

Jones

Smith

Set Difference Operation Set difference is denoted by the minus sign ( ). It finds tuples that are

in one relation, but not in another. Thus results in a relation containing tuples that are in

but not in .

Cartesian Product Operation The Cartesian product of two relations is denoted by a cross ( ),

written

The result of is a new relation with a tuple for each possible pairing of tuples from

and .

Chapter-2

RELATIONAL DATABASE MODEL End Chapter quizzes:

Q1. Which of the following are characteristics of an RDBMS? a) Data are organized in a series of two-dimensional tables each of which contains records for one entity. b) Queries are possible on individual or groups of tables. c) It cannot use SQL. d) Tables are linked by common data known as keys.

Q2. The keys that can have NULL values are a). Primary Key b). Unique Key c). Foreign Key d). Both b and c

Q3 . GRANT and REVOKE are (a) DDL statements (b) DML statements (c) DCL statements (d) None of these.

Q4. Rows of a relation are called (a) tuples

(b) a relation row

(c) a data structure (d) an entity

Q5. Primary Key column in the Table (a) Can’t accept NULL values

(b) Can’t accept duplicate values

(c) Can’t be more than one

(d) All of the above

Q6. A table can have how many primary key A). any number

B). 1

C). 255

D). None of the above

Q7. Projection operation is: a) Unary operation

b) Ternary operation

c) binary operation

d) None of the above

Q8. The keys that can have NULL values are A). Primary Key

B). Unique Key

C). Foreign Key

D). Both b and c

Q9. Referential integrity constraint is specified between two relations

a) True

b) False

Q10 Union operation in relational algebra is performed on

a) Single Relation

b) Two relation

c) Both a and b

d) None

Q11. As per Codd’s rule NULL value is same as

a) blank space

b) Zero

c) Character string

d) None of the above.

Q12 Relational Algebra is a non procedural query language

a) True

b) False

Chapter: 3

FUNCTIONAL DEPENDENCY AND NORMALIZATION

1. Functional Dependency

Consider a relation R that has two attributes A and B. The attribute B of the relation is

functionally dependent on the attribute A if and only if for each value of A no more than one

value of B is associated. In other words, the value of attribute A uniquely determines the value of

B and if there were several tuples that had the same value of A then all these tuples will have an

identical value of attribute B. That is, if t1 and t2 are two tuples in the relation R and t1(A) =

t2(A) then we must have t1(B) = t2(B).

A and B need not be single attributes. They could be any subsets of the attributes of a relation R

(possibly single attributes). We may then write

R.A -> R.B

If B is functionally dependent on A (or A functionally determines B). Note that functional

dependency does not imply a one-to-one relationship between A and B although a one-to-one

relationship may exist between A and B.

A simple example of the above functional dependency is when A is a primary key of an entity

(e.g. student number) and A is some single-valued property or attribute of the entity (e.g. date of

birth). A -> B then must always hold.

Functional dependencies also arise in relationships. Let C be the primary key of an entity and D

be the primary key of another entity. Let the two entities have a relationship. If the relationship is

one-to-one, we must have C -> D and D -> C. If the relationship is many-to-one, we would have

C -> D but not D -> C. For many-to-many relationships, no functional dependencies hold. For

example, if C is student number and D is subject number, there is no functional dependency

between them. If however, we were storing marks and grades in the database as well, we would

have

(student_number, subject_number) -> marks and we might have

marks -> grades

The second functional dependency above assumes that the grades are dependent only on the

marks. This may sometime not be true since the instructor may decide to take other

considerations into account in assigning grades, for example, the class average mark.

For example, in the student database that we have discussed earlier, we have the following

functional dependencies:

sno -> sname sno -> address cno -> cname cno -> instructor

instructor -> office

These functional dependencies imply that there can be only one name for each sno, only one

address for each student and only one subject name for each cno. It is of course possible that

several students may have the same name and several students may live at the same address. If

we consider cno -> instructor, the dependency implies that no subject can have more than one

instructor (perhaps this is not a very realistic assumption). Functional dependencies therefore

place constraints on what information the database may store. In the above example, one may be

wondering if the following FDs hold

sname -> sno cname -> cno

Certainly there is nothing in the instance of the example database presented above that

contradicts the above functional dependencies. However, whether above FDs hold or not would

depend on whether the university or college whose database we are considering allows duplicate

student names and subject names. If it was the enterprise policy to have unique subject names

than cname -> cno holds. If duplicate student names are possible, and one would think there

always is the possibility of two students having exactly the same name, then sname -> sno does

not hold.

Functional dependencies arise from the nature of the real world that the database models. Often

A and B are facts about an entity where A might be some identifier for the entity and B some

characteristic. Functional dependencies cannot be automatically determined by studying one or

more instances of a database. They can be determined only by a careful study of the real world

and a clear understanding of what each attribute means.

We have noted above that the definition of functional dependency does not require that A and B

be single attributes. In fact, A and B may be collections of attributes. For example

(sno, cno) -> (mark, date)

When dealing with a collection of attributes, the concept of full functional dependence is an

important one. Let A and B be distinct collections of attributes from a relation R end let R.A ->

R.B. B is then

fully functionally dependent on A if B is not functionally dependent on any subset of A. The

above example of students and subjects would show full functional dependence if mark and date

are not functionally dependent on either student number ( sno) or subject number ( cno) alone.

The implies that we are assuming that a student may have more than one subjects and a subject

would be taken by many different students. Furthermore, it has been assumed that there is at

most one enrolment of each student in the same subject.

The above example illustrates full functional dependence. However the following dependence

(sno, cno) -> instructor is not full functional dependence because cno -> instructor holds.

As noted earlier, the concept of functional dependency is related to the concept of candidate key

of a relation since a candidate key of a relation is an identifier which uniquely identifies a tuple

and therefore determines the values of all other attributes in the relation. Therefore any subset X

of the attributes of a relation R that satisfies the property that all remaining attributes of the

relation are functionally dependent on it (that is, on X), then X is candidate key as long as no

attribute can be removed from X and still satisfy the property of functional dependence. In the

example above, the attributes (sno, cno) form a candidate key (and the only one) since they

functionally determine all the remaining attributes.

Functional dependence is an important concept and a large body of formal theory has been

developed about it. We discuss the concept of closure that helps us derive all functional

dependencies that are implied by a given set of dependencies. Once a complete set of functional

dependencies has been obtained, we will study how these may be used to build normalised

relations.

Rules about Functional Dependencies

Let F be set of FDs specified on R

Must be able to reason about FD’s in F

Schema designer usually explicitly states only FD’s which are obvious

Without knowing exactly what all tuples are, must be able to deduce other/all FD’s that hold on

R

Essential when we discuss design of “good” relational schemas

Design of Relational Database Schemas

Problems such as redundancy that occur when we try to cram too much into a single relation are

called anomalies. The principal kinds of anomalies that we encounter are:

_ Redundancy. Information may be repeated unnecessarily in several tuples.

_ Update Anomalies. We may change information in one tuples but leave the same information

unchanged in another.

_ Deletion Anomalies. If a set of values becomes empty, we may lose other information as side

effect.

2 Normalization

Designing a database, usually a data model is translated into relational schema. The important

question is whether there is a design methodology or is the process arbitrary. A simple answer to

this question is affirmative. There are certain properties that a good database design must possess

as dictated by Codd’s rules. There are many different ways of designing good database. One of

such methodologies is the method involving ‘Normalization’. Normalization theory is built

around the concept of normal forms. Normalization reduces redundancy. Redundancy is

unnecessary repetition of data. It can cause problems with storage and retrieval of data. During

the process of normalization, dependencies can be identified, which can cause problems during

deletion and updation. Normalization theory is based on the fundamental notion of Dependency.

Normalization helps in simplifying the structure of schema and tables.

For example the normal forms; we will take an example of a database of the following logical

design: Relation S

{ S#, SUPPLIERNAME, SUPPLYTATUS, SUPPLYCITY}, Primary Key{S#}

Relation P { P#, PARTNAME, PARTCOLOR, PARTWEIGHT, SUPPLYCITY}, Primary

Key{P#}

Relation SP { S#, SUPPLYCITY, P#, PARTQTY}, Primary Key{S#, P#}

5F

Foreign Key{S#} Reference S

Foreign Key{P#} Reference P

S# SUPPLYCITY P# PARTQTY S1 Bombay P1 3000 S1 Bombay P2 2000 S1 Bombay P3 4000 S1 Bombay P4 2000 S1 Bombay P5 1000 S1 Bombay P6 1000 S2 Mumbai P1 3000 S2 Mumbai P2 4000 S3 Mumbai P2 2000 S4 Madras P2 2000 S4 Madras P4 3000 S4 Madras P5 4000

Let us examine the table above to find any design discrepancy. A quick glance reveals that some

of the data are being repeated. That is data redundancy, which is of course an undesirable. The

fact that a particular supplier is located in a city has been repeated many times. This redundancy

causes many other related problems. For instance, after an update a supplier may be displayed to

be from Madras in one entry while from Mumbai in another. This further gives rise to many

other problems.

Therefore, for the above reasons, the tables need to be refined. This process of refinement of a

given schema into another schema or a set of schema possessing qualities of a good database is

known as Normalization. Database experts have defined a series of Normal forms each

conforming to some specified design

Decomposition. Decomposition is the process of splitting a relation into two or more relations.

This is nothing but projection process. Decompositions may or may not loose information. As

you would learn shortly, that normalization process involves breaking a given relation into one

or more relations and also that these decompositions should be reversible as well, so that no

information is lost in the process. Thus, we will be interested more with the decompositions that

incur no loss of information rather than the ones in which information is lost.

Lossless decomposition: The decomposition, which results into relations without loosing any

information, is known as lossless decomposition or nonloss decomposition. The decomposition

that results in loss of information is known as lossy decomposition.

Consider the relation S{S#, SUPPLYSTATUS, SUPPLYCITY} with some instances of the entries as

shown below.

S S# SUPPLYSTATUS SUPPLYCITY

S3 100 Madras

S5 100 Mumbai

Let us decompose this table into two as shown below:

(1) SX S# SUPPLYSTATUS SY S# SUPPLYCITY

S3 100 S3 Madras

S5 100 S5 Mumbai

(2) SX S# SUPPLYSTATUS SY SUPPLYSTATUS SUPPLYCITY

S3 100 100 Madras

S5 100 100 Mumbai

Let us examine these decompositions. In decomposition (1) no information is lost. We can still

say that S3’s status is 100 and location is Madras and also that supplier S5 has 100 as its status

and location Mumbai. This decomposition is therefore lossless.

In decomposition (2), however, we can still say that status of both S3 and S5 is 100. But the

location of suppliers cannot be determined by these two tables. The information regarding the

location of the suppliers has been lost in this case. This is a lossy decomposition. Certainly,

lossless decomposition is more desirable because otherwise the decomposition will be

irreversible. The decomposition process is in fact projection, where some attributes are selected

from a table. A natural question arises here as to why the first decomposition is lossless while the

second one is lossy? How should a given relation must be decomposed so that the resulting

projections are nonlossy? Answer to these questions lies in functional dependencies and may be

given by the following theorem.

Heath’s theorem: Let R {A, B, C} be a relation, where A, B and C are sets of attributes. If R

B} and {A, C}.

Let us apply this theorem on the decompositions described above. We observe that relation S

satisfies two irreducible sets of FD’s

Now taking A as S#, B as SUPPLYSTATUS, and C as SUPPLYCITY, this theorem confirms

that relation S can be nonloss decomposition into its projections on {S#, SUPPLYSTATUS} and

{S#, SUPPLYCITY} . Note, however, that the theorem does not say why projections {S#,

SUPPLYSTATUS} and {SUPPLYSTATUS, SUPPLYCITY} should be lossy. Yet we can see

lost.

An alternative criteria for lossless decomposition is as follows. Let R be a relation schema, and

let F be a set of functional dependencies on R. let R1 and R2 form a decomposition of R. this

decomposition is a lossless-join decomposition of R if at least one of the following functional

dependencies are in F+:

R1

R2

2.1 First Normal Form

A relation is in 1st Normal form (1NF) if and only if, in every legal value of that relation, every

tuple contains exactly one value for each attribute.

Although, simplest, 1NF relations have a number of discrepancies and therefore it not the most

desirable form of a relation.

Let us take a relation (modified to illustrate the point in discussion) as

Rel1 {S#, SUPPLYSTATUS, SUPPLYCITY, P#, PARTQTY}

Primary Key{S#, P#}

FD {SUPPLYCITY SUPPLYSTATUS}

Note that SUPPLYSTATUS is functionally dependent on SUPPLYCITY; meaning that a

supplier’s status is determined by the location of that supplier – e.g. all suppliers from Madras

must have status of 100. The primary key of the relation Rel1 is {S#, P#}.

Let us discuss some of the problems with this 1NF relation. For the purpose of

illustration, let us insert some sample tuples into this relation

REL1 S# SUPPLYSTATUS SUPPLYCITY P# PARTQTY

S1 200 Madras P1 3000

S1 200 Madras P2 2000

S1 200 Madras P3 4000

S1 200 Madras P4 2000

S1 200 Madras P5 1000

S1 200 Madras P6 1000

S2 100 Mumbai P1 3000

S2 100 Mumbai P2 4000

S3 100 Mumbai P2 2000

S4 200 Madras P2 2000

S4 200 Madras P4 3000

S4 200 Madras P5 4000

The redundancies in the above relation causes many problems – usually known as update

anomalies, that is in INSERT, DELETE and UPDATE operations. Let us see these problems

due to supplier-

INSERT: In this relation, unless a supplier supplies at least one part, we cannot insert the

information regarding a supplier. Thus, a supplier located in Kolkata is missing from the relation

because he has not supplied any part so far.

DELETE: Let us see what problem we may face during deletion of a tuple. If we delete the

tuple of a supplier (if there is a single entry for that supplier), we not only delte the fact that the

supplier supplied a particular part but also the fact that the supplier is located in a particular city.

In our case, if we delete entries corresponding to S#=S2, we loose the information that the

supplier is located at Mumbai. This is definitely undesirable. The problem here is there are too

many informations attached to each tuple, therefore deletion forces loosing too many

informations.

UPDATE: If we modify the city of a supplier S1 to Mumbai from Madras, we have to make sure

that all the entries corresponding to S#=S1 are updated otherwise inconsistency will be

introduced. As a result some entries will suggest that the supplier is located at Madras while

others will contradict this fact.

2.2 Second Normal Form

A relation is in 2NF if and only if it is in 1NF and every nonkey attribute is fully functionally

dependent on the primary key. Here it has been assumed that there is only one candidate key,

which is of course primary key.

A relation in 1NF can always decomposed into an equivalent set of 2NF relations. The reduction

process consists of replacing the 1NF relation by suitable projections.

We have seen the problems arising due to the less-normalization (1NF) of the relation. The

remedy is to break the relation into two simpler relations.

REL2{S#, SUPPLYSTATUS, SUPPLYCITY} and

REL3{S#, P#, PARTQTY}

REL2 and REL3 are in 2NF with their {S#} and {S#, P#} respectively. This is because all

nonkeys of REL1{ SUPPLYSTATUS, SUPPLYCITY}, each is functionally dependent on the

primary key that is S#. By similar argument, REL3 is also in 2NF. Evidently, these two relations

have overcome all the update anomalies stated earlier. Now it is possible to insert the facts

regarding supplier S5 even when he is not supplied any part, which was earlier not possible. This

solves insert problem. Similarly, delete and update problems are also over now.

These relations in 2NF are still not free from all the anomalies. REL3 is free from most of the

problems we are going to discuss here, however, REL2 still carries some problems. The reason is

that the dependency of SUPPLYSTATUS on S# is though functional, it is transitive via

transitive dependency. We will see that this transitive dependency gives rise to another set of

anomalies.

INSERT: We are unable to insert the fact that a particular city has a particular status until we

have some supplier actually located in that city.

DELETE: If we delete sole REL2 tuple for a particular city, we delete the information that that

city has that particular status.

UPDATE: The status for a given city still has redundancy. This causes usual redundancy

problem related to update.

2.3 Third Normal Form A relation is in 3NF if only if it is in 2NF and every non-key attribute is non-transitively

dependent on the primary key.

To convert the 2NF relation into 3NF, once again, the REL2 is split into two simpler relations –

REL4 and REL5 as shown below.

RELATION 4 {S#, SUPPLYCITY}

and

RELATION 5

{SUPPLYCITY SUPLLYSTATUS} Sample relation is shown below.

RELATION 4 RELATION 5

S# SUPPLYCITY SUPPLYCITY SUPPLYSTATUS

S1 Madras Madras 200

S2 Mumbai Mumbai 100

S3 Mumbai Kolakata 300

S4 Madras

S5 Kolkata

Evidently, the above relations RELATION 4 and RELATION5 are in 3NF, because there is no transitive

dependencies. Every 2NF can be reduced into 3NF by decomposing it further and removing any

transitive dependency.

2.4 Boyce-Codd Normal Form

The previous normal forms assumed that there was just one candidate key in the relation and that

key was also the primary key. Another class of problems arises when this is not the case. Very

often there will be more candidate keys than one in practical database designing situation. To be

precise the 1NF, 2NF and 3NF did not deal adequately with the case of relations that had two or

more candidate keys, and that the candidate keys were composite, and they overlapped

(i.e. had at least one attribute common).

A relation is in BCNF (Boyce-Codd Normal Form) if and only if every nontrivial, left-

irreducible FD has a candidate key as its determinant. Or

A relation is in BCNF if and only if all the determinants are candidate keys.

It should be noted that the BCNF definition is conceptually simpler than the old 3NF definition,

in that it makes no explicit reference to first and second normal forms as such, nor to the concept

of transitive dependence. Furthermore, although BCNF is strictly stronger than 3NF, it is still the

case that any given relation can be nonloss decomposed into an equivalent collection of BCNF

relations. Thus, relations REL 1 and REL 2 which were not in 3NF, are not in BCNF either; also

that relations REL3, REL 4, and REL5, which were in 3NF, are also in BCNF. Relation REL1

contains three determinants, namely {S#}, {SUPPLYCITY}, and {S#, P#}; of these, only {S#,

P#} is a candidate key, so REL1 is not in BCNF. Similarly, REL2 is not in BCNF either, because

the determinant {SUPPLYCITY} is not a candidate key. Relations REL 3, REL 4, and REL 5,

on the other hand, are each in BCNF, because in each case the sole candidate key is the only

determinant in the respective relations.

2.5 Comparison of BCNF and 3NF

We have seen two normal forms for relational-database schemas: 3NF and BCNF. There is an

advantage to 3NF in that we know that it is always possible to obtain a 3NF design without

sacrificing a lossless join or dependency preservation. Nevertheless, there is a disadvantage to

3NF. If we do not eliminate all transitive dependencies, we may have to use null values to

represent some of the possible meaningful relationship among data items, and there is the

problem of repetition of information. The other difficulty is the repetition of information.

If we are forced to choose between BCNF and dependency preservation with 3NF, it is generally

preferable to opt for 3NF. If we cannot test for dependency preservation efficiently, we either

pay a high penalty in system performance or risk the integrity of the data in our database. Neither

of these alternatives is attractive.

With such alternatives, the limited amount of redundancy imposed by transitive dependencies

allowed under 3NF is the lesser evil.

Thus, we normally choose to retain dependency preservation and to sacrifice BCNF.

2.6 Multi-valued dependency

Multi-valued dependency may be formally defined as:

Let R be a relation, and let A, B, and C be subsets of the attributes of R. Then we say that B is

multi-dependent on A - in symbols,

A B

read "A multi-determines B," or simply "A double arrow B") - if and only if, in every possible

legal value of R, the set of B values matching a given A value, C value pair depends only on the

A value and is independent of the C value.

2.7 Fifth Normal Form

It seems that the sole operation necessary or available in the further normalization process is the

replacement of a relation in a nonloss way by exactly two of its projections. This assumption has

successfully carried us as far as 4NF. It comes perhaps as a surprise, therefore, to discover that

there exist relations that cannot be nonloss-decomposed into two projections but can be nonloss-

decomposed into three (or more). An unpleasant but convenient term, we will describe such a

relation as "n-decomposable" (for some n > 2) - meaning that the relation in question can be

nonloss-decomposed into n projections but not into m for any m < n.

A relation that can be nonloss-decomposed into two projections we will call "2-decomposable"

and similarly term “n-decomposable” may be defined.

2.8 Join Dependency: Let R be a relation, and let A, B, Z be subsets of the attributes of R. Then we say that R satisfies

the Join Dependency (JD)

*{ A, B, ..., Z} (Read "star A, K ..., Z") if and only if every possible legal value of R is equal to

the join of its projections on A, B,..., Z.

Fifth normal form: A relation R is in 5NF - also called projection-join normal torn (PJ/NF) - if

and only if every nontrivial* join dependency that holds for R is implied by the candidate keys of

R. Let us understand what it means for a JD to be "implied by candidate keys."

Relation REL12 is not in 5NF, it satisfies a certain join dependency, namely Constraint 3D, that

is certainly not implied by its sole candidate key (that key being the combination of all of its

attributes).

Now let us understand through an example, what it means for a JD to be implied by candidate

keys. Suppose that the familiar suppliers relation REL1 has two candidate keys, {S#} and

{SUPPLIERNAME}. Then that relation satisfies several join dependencies - for example, it

satisfies the JD

*{ { S#, SUPPLIERNAME, SUPPLYSTATUS }, { S#, SUPPLYCITY } }

That is, relation REL1 is equal to the join of its projections on {S#, SUPPLIERNAME,

SUPPLYSTATUS} and {S#, SUPPLYCITY), and hence can be nonloss-decomposed into those

projections. (This fact does not mean that it should be so decomposed, of course, only that it

could be.) This JD is implied by the fact that {S#} is a candidate key (in fact it is implied by

Heath's theorem) Likewise, relation REL1 also satisfies the JD

* {{S#, SUPPLIERNAME}, {S#, SUPPLYSTATUS}, {SUPPLIERNAME, SUPPLYCITY}}

This JD is implied by the fact that {S#} and { SUPPLYNAME} are both candidate keys.

To conclude, we note that it follows from the definition that 5NF is the ultimate normal form

with respect to projection and join (which accounts for its alternative name, projection-join

normal form). That is, a relation in 5NF is guaranteed to be free of anomalies that can be

eliminated by taking projections. For a relation is in 5NF the only join dependencies are those

that are implied by candidate keys, and so the only valid decompositions are ones that are based

on those candidate keys.

Chapter-3

FUNCTIONAL DEPENDENCY AND NORMALIZATION End Chapter quizzes:

Q1 Normalization is step by step process of decomposing:

(e) Table

(f) Database

(g) Group Data item

(h) All of the above

Q2 A relation is said to be in 2 NF if (i) it is in 1 NF

(ii) non-key attributes dependent on key attribute

(iii) non-key attributes are independent of one another

(iv) if it has a composite key, no non-key attribute should be dependent on

part of the composite key.

(a) i, ii, iii (b) i and ii

(c) i, ii, iv (d) i, iv

Q3. A relation is said to be in 3 NF if (i) it is in 2 NF

(ii) non-key attributes are independent of one another

(iii) key attribute is not dependent on part of a composite key

(iv) has no multi-valued dependency

(a) i and iii (b) i and iv

(c) i and ii (d) ii and iv

Q4. A relation is said to be in BCNF when (a) it has overlapping composite keys

(b) it has no composite keys

(c) it has no multivalued dependencies

(d) it has no overlapping composite keys which have related attributes

Q5. Fourth normal form (4 NF) relations are needed when. (a) there are multivalued dependencies between attributes in composite key

(b) there are more than one composite key

(c) there are two or more overlapping composite keys

(d) there are multivalued dependency between non-key attributes

Q6. A good database design (i) is expandable with growth and changes in organization

(ii) easy to change when software changes

(iii) ensures data integrity

(iv) allows access to only authorized users

(a) i, ii (b) ii, iii

(c) i, ii, iii, iv (d) i, ii, iii

Q7. Given an attribute x, another attribute y is dependent on it, if for a given x (a) there are many y values

(b) there is only one value of y

(c) there is one or more y values

(d) there is none or one y value

Q8. If a non key attribute is depending on another non key attribute, It is known as

a) Full F D

b) Partial F D

c) TRANSITIVE F D


Q9. Decomposition of relation should always be

a) Lossy

b) Lossless

c) Both a and b


Chapter: 4

STRUCTURE QUERY LANGUAGE

1. INTRODUCTARY CONCEPT

1.1 What is SQL?

SQL stands for Structured Query Language

SQL allows you to access a database

SQL is an ANSI standard computer language

SQL can execute queries against a database

SQL can retrieve data from a database

SQL can insert new records in a database

SQL can delete records from a database

SQL can update records in a database

SQL is easy to learn

SQL is an ANSI (American National Standards Institute) standard computer language for

accessing and manipulating database systems. SQL statements are used to retrieve and update

data in a database. SQL works with database programs like MS Access, DB2, Informix, MS

SQL Server, Oracle, Sybase, etc

1.2 SQL Database Tables:

A database most often contains one or more tables. Each table is identified by a name (e.g.

"Customers" or "Orders"). Tables contain records (rows) with data.

Below is an example of a table called "Persons":

LastName FirstName Address City

Hansen Ola Timoteivn 10 Sandnes

Svendson Tove Borgvn 23 Sandnes

Pettersen Kari Storgt 20 Stavanger

The table above contains three records (one for each person) and four columns (LastName,

FirstName, Address, and City).

2. DATABASE LANGUAGE

2.1 SQL Data Definition Language (DDL)

The Data Definition Language (DDL) part of SQL permits database tables to be created or

deleted. We can also define indexes (keys), specify links between tables, and impose constraints

between database tables.

The most important DDL statements in SQL are:

CREATE TABLE - creates a new database table

ALTER TABLE - alters (changes) a database table

DROP TABLE - deletes a database table

Create a Table

To create a table in a database:

CREATE TABLE table_name

(

column_name1 data_type,

column_name2 data_type,

.......

)

Example

This example demonstrates how you can create a table named "Person", with four columns. The

column names will be "LastName", "FirstName", "Address", and "Age":

ALTER TABLE

The ALTER TABLE statement is used to add, drop and modify columns in an existing table.

ALTER TABLE table_name

ADD column_name datatype


MODIFY column_name datatype


DROP COLUMN column_name

Delete a Table or Database

To delete a table (the table structure attributes, and indexes will also be deleted):

DROP TABLE table_name

2.2 SQL Data Manipulation Language (DML)

DML language includes syntax to update, insert, and delete records. These query and update

commands together form the Data Manipulation Language (DML) part of SQL:

UPDATE - updates data in a database table

DELETE - deletes data from a database table

INSERT INTO - inserts new data into a database table

The INSERT INTO Statement

The INSERT INTO statement is used to insert new rows into a table.

Syntax

INSERT INTO table_name

VALUES (value1, value2,....)

You can also specify the columns for which you want to insert data:

INSERT INTO table_name (column1, column2,...)

VALUES (value1, value2,....)

The Update Statement

The UPDATE statement is used to modify the data in a table.

Syntax

UPDATE table_name

SET column_name = new_value

WHERE column_name = some_value

The DELETE Statement

The DELETE statement is used to delete rows in a table.

Syntax

DELETE FROM table_name

WHERE column_name = some_value

2.3 SQL Data Manipulation Language (DQL)

It is used to retrieve the existing data from the database, using select statements.

SQL SELECT Example

To select the content of columns named "LastName" and "FirstName", from the database table

called "Persons", use a SELECT statement like this:

SELECT LastName, FirstName

FROM Persons

The WHERE Clause

To conditionally select data from a table, a WHERE clause can be added to the SELECT

statement.

Syntax

SELECT column FROM table

WHERE column operator value

With the WHERE clause, the following operators can be used:

Operator Description

= Equal

<> Not equal

> Greater than

< Less than

>= Greater than or equal

<= Less than or equal

BETWEEN Between an inclusive range

LIKE Search for a pattern

Using the WHERE Clause

To select only the persons living in the city "Sandnes", we add a WHERE clause to the SELECT

statement:

SELECT * FROM Persons

WHERE City='Sandnes'

"Persons" table

LastName FirstName Address City Year

Hansen Ola Timoteivn 10 Sandnes 1951

Svendson Tove Borgvn 23 Sandnes 1978

Svendson Stale Kaivn 18 Sandnes 1980

Pettersen Kari Storgt 20 Stavanger 1960

Result

LastName FirstName Address City Year

Hansen Ola Timoteivn 10 Sandnes 1951

Svendson Tove Borgvn 23 Sandnes 1978

Svendson Stale Kaivn 18 Sandnes 1980

The LIKE Condition

The LIKE condition is used to specify a search for a pattern in a column.

Syntax

SELECT column

FROM table

WHERE column LIKE pattern

A "%" sign can be used to define wildcards (missing letters in the pattern) both before and after

the pattern.

Using LIKE

The following SQL statement will return persons with first names that start with an 'O':

SELECT *

FROM Persons

WHERE FirstName LIKE 'O%'

The ORDER BY keyword is used to sort the result.

Sort the Rows

The ORDER BY clause is used to sort the rows.

Orders:

Company OrderNumber

Sega 3412

ABC Shop 5678

W3Schools 2312

W3Schools 6798

Example

To display the companies in alphabetical order:

SELECT Company, OrderNumber FROM Orders

ORDER BY Company

Result:

Company OrderNumber

ABC Shop 5678

Sega 3412

W3Schools 6798

W3Schools 2312

Example

To display the companies in alphabetical order AND the order numbers in numerical order:

SELECT Company, OrderNumber FROM Orders

ORDER BY Company, OrderNumber

Result:

Company OrderNumber

ABC Shop 5678

Sega 3412

W3Schools 2312

W3Schools 6798

GROUP BY...

Aggregate functions (like SUM) often need an added GROUP BY functionality.

GROUP BY... was added to SQL because aggregate functions (like SUM) return the aggregate

of all column values every time they are called, and without the GROUP BY function it was

impossible to find the sum for each individual group of column values.

The syntax for the GROUP BY function is:

SELECT column, SUM(column)

FROM table

GROUP BY column

GROUP BY Example

This "Sales" Table:

Company Amount

W3Schools 5500

IBM 4500

W3Schools 7100

3. What is a View?

In SQL, a VIEW is a virtual table based on the result-set of a SELECT statement.

A view contains rows and columns, just like a real table. The fields in a view are fields from one

or more real tables in the database. You can add SQL functions, WHERE, and JOIN statements

to a view and present the data as if the data were coming from a single table.

Syntax

CREATE VIEW view_name AS

SELECT column_name(s)

FROM table_name

WHERE condition

View is of two types’ updateable view and non-updateable view. Using updateable view value of

the table can be modified where as in case of non updateable view base table can not be updated.

4. Rename of a Table Column

ALTER TABLE <table>

RENAME <oldname> TO <newname>;

RENAME TABLE student TO student_new

This SQL command will rename the student table to student_new

5. Renames a SQL view in the current database.

RENAME VIEW ViewName1 TO ViewName2

Parameters

ViewName1

Specifies the name of the SQL view to be renamed.

ViewName2

Specifies the new name of the SQL view.

6. Renaming Columns & Constraints

In addition to renaming tables and indexes Oracle9i Release 2 allows the renaming of columns

and constraints on tables. In this example once the the TEST1 table is created it is renamed along

with it's columns, primary key constraint and the index that supports the primary key:

SQL> CREATE TABLE test1

(

2 col1 NUMBER(10) NOT NULL,

3 col2 VARCHAR2(50) NOT NULL );

Table created.

SQL> ALTER TABLE test1

ADD (

2 CONSTRAINT test1_pk PRIMARY KEY (col1));

Table altered.

SQL> DESC test1

Name Null? Type

-------------------- -------- --------------------

COL1 NOT NULL NUMBER(10)

COL2 NOT NULL VARCHAR2(50)

SQL> SELECT constraint_name

2 FROM user_constraints

3 WHERE table_name = 'TEST1'

4 AND c onstraint_type = 'P';

CONSTRAINT_NAME

------------------------------

TEST1_PK

1 row selected.

SQL> SELECT index_name, column_name

2 FROM user_ind_columns

3 WHERE table_name = 'TEST1';

INDEX_NAME COLUMN_NAME

-------------------- --------------------

TEST1_PK COL1

1 row selected.

SQL> -- Rename the table, columns, primary key

SQL> -- and supporting index.

SQL> ALTER TABLE test1 RENAME TO test;

Table altered.

SQL> ALTER TABLE test RENAME COLUMN col1 TO id;

Table altered.

SQL> ALTER TABLE test RENAME COLUMN col2 TO description;

Table altered.

SQL> ALTER TABLE test RENAME CONSTRAINT test1_pk TO test_pk;

Table altered.

SQL> ALTER INDEX test1_pk RENAME TO test_pk;

Index altered.

SQL> DESC test

Name Null? Type

-------------------- -------- --------------------

ID NOT NULL NUMBER(10)

DESCRIPTION NOT NULL VARCHAR2(50)

SQL> SELECT constraint_name

2 FROM user_constraints

3 WHERE table_name = 'TEST'

4 AND constraint_type = 'P';

CONSTRAINT_NAME

--------------------

TEST_PK

1 row selected.

SQL> SELECT index_name, column_name

2 FROM user_ind_columns

3 WHERE table_name = 'TEST';

INDEX_NAME COLUMN_NAME

-------------------- --------------------

TEST_PK ID

1 row selected.

STRUCTURE QUERY LANGUAGE

End Chapter quizzes: Q1 SELECT statement is used for

a) Updating data in the database

b) Retrieving data from the database

c) Change in the structure of database


Q2. Select the correct statement

a) ALTER statement is used to modify the structure of Database.

b) Update statement is used to change the data into the table.

c) SELECT statement is used to retrieve the data from the database

d) All of the above.

Q3. Which of the following statements are NOT TRUE about ORDER BY clauses?

A. Ascending or descending order can be defined with the asc or desc keywords.

B. Only one column can be used to define the sort order in an order by clause.

C. Multiple columns can be used to define sort order in an order by clause.

D. Columns can be represented by numbers indicating their listed order in the select

Q4 GRANT and REVOKE are

(a) DDL statements

(b) DML statements

(c) DCL statements

(d) None of these.

Q5. Oracle 8i can be best described as

(a) Object-based DBMS

(b) Object-oriented DBMS

(c) Object-relational DBMS

(d) Relational DBMS

Q6. Select the correct statement.

a) View has no physical existence.

b) Data from the view are retrieved through the Table.

c) Both (a) and (b)

d) None of these.

Q7 INSERT statement is used to

a) Storing data into the Table

b) Deleting data from the Table

c) Both a and b

d) Updating data in the table

Q8 ALTER statement is used to

A) Changing structure of the table

B ) Changing data from the Table

C ) Both a and b

D ) Deleting data from the table

Q9. RENAME TABLE student TO student_new

a) Rename the column of the Table

b) Change the Table name student to student_new

c) Rename the row of the table

d) None of the above.

Q10. ORDER By clause ids used to

a) Sort the row of the table in a particular order

b) Remove the column of the table

c) Rename the Table

d) Both a and c

Chapter: 5

PROCEDURAL QUERY LANGUAGE

1. Introduction to PL/SQL

PL/SQL is a procedural extension for Oracle’s Structured Query Language. PL/SQL is not a

separate language rather a technology. Mean to say that you will not have a separate place or

prompt for executing your PL/SQL programs. PL/SQL technology is like an engine that executes

PL/SQL blocks and subprograms. This engine can be started in Oracle server or in application

development tools such as Oracle Forms, Oracle Reports etc.

As shown in the above figure PL/SQL engine executes procedural statements and sends SQL

part of statements to SQL statement processor in the Oracle server. PL/SQL combines the data

manipulating power of SQL with the data processing power of procedural languages.

2 Block Structure of PL/SQL:

PL/SQL is a block-structured language. It means that Programs of PL/SQL contain logical

blocks. PL/SQL block consists of SQL and PL/SQL statements.

A PL/SQL Block consists of three sections:

The Declaration section (optional).

The Execution section (mandatory).

The Exception (or Error) Handling section (optional).

2.1 Declaration Section:

The Declaration section of a PL/SQL Block starts with the reserved keyword DECLARE. This

section is optional and is used to declare any placeholders like variables, constants, records and

cursors, which are used to manipulate data in the execution section. Placeholders may be any of

Variables, Constants and Records, which stores data temporarily. Cursors are also declared in

this section.

Declaring Variables: Variables are declared in DECLARE section of PL/SQL.

DECLARE

SNO NUMBER (3);

SNAME VARCHAR2 (15);

2.2 Execution Section:

The Execution section of a PL/SQL Block starts with the reserved keyword BEGIN and ends

with END. This is a mandatory section and is the section where the program logic is written to

perform any task. The programmatic constructs like loops, conditional statement and SQL

statements form the part of execution section.

2.3 Exception Section:

The Exception section of a PL/SQL Block starts with the reserved keyword EXCEPTION. This

section is optional. Any errors in the program can be handled in this section, so that the PL/SQL

Blocks terminates gracefully. If the PL/SQL Block contains exceptions that cannot be handled,

the Block terminates abruptly with errors. Every statement in the above three sections must end

with a semicolon (;). PL/SQL blocks can be nested within other PL/SQL blocks. Comments can

be used to document code.

3. How a sample PL/SQL Block looks.

DECLARE

Variable declaration

BEGIN

Program Execution

EXCEPTION

Exception handling

Variables and Constants: Variables are used to store query results. Forward references are not

allowed. Hence you must first declare the variable and then use it.

Variables can have any SQL data type, such as CHAR, DATE, NUMBER etc or any PL/SQL

data type like BOOLEAN, BINARY_INTEGER etc.

Declaring Variables: Variables are declared in DECLARE section of PL/SQL.

DECLARE

SNO NUMBER (3);

SNAME VARCHAR2 (15);

BEGIN

Assigning values to variables:

SNO NUMBER: = 1001;

or

SNAME: = ‘JOHN’; etc

Following screen shot explain you how to write a simple PL/SQL program and execute it

.

SET SERVEROUTPUT ON is a command used to access results from Oracle Server.

A PL/SQL program is terminated by a “/ “. DBMS_OUTPUT is a package and PUT_LINE is a

procedure in it.

You will learn more about procedures, functions and packages in the following sections of this

tutorial.

Above program can also be written as a text file in Notepad editor and then executed as

explained in the following screen shot.

4. Control Statements This section explains about how to structure flow of control through a PL/SQL program. The

control structures of PL/SQL are simple yet powerful. Control structures in PL/SQL can be

divided into selection:

Conditional,

Iterative and

Sequential.

4.1 Conditional Control (Selection): This structure tests a condition, depending on the

condition is true or false it decides the sequence of statements to be executed.

Example

Syntax for IF-THEN

IF THEN

Statements

END IF;

Example:

Syntax for IF-THEN-ELSE:

IF THEN

Statements

ELSE

statements

END IF;

Example:

Syntax for IF-THEN-ELSIF:

IF THEN

Statements

ELSIF THEN

Statements

ELSE

Statements

END IF;

4.2 Iterative Control

LOOP statement executes the body statements multiple times. The statements are placed

between LOOP – END LOOP keywords. The simplest form of LOOP statement is an infinite

loop. EXIT statement is used inside LOOP to terminate it.

Syntax for LOOP- END LOOP

LOOP

Statements

END LOOP;

Example:

BEGIN

LOOP

DBMS_OUTPUT.PUT_LINE (‘Hello’);

END LOOP;

END;

5. CURSOR

For every SQL statement execution certain area in memory is allocated. PL/SQL allows you to

name this area. This private SQL area is called context area or cursor. A cursor acts as a handle

or pointer into the context area. A PL/SQL program controls the context area using the cursor.

Cursor represents a structure in memory and is different from cursor variable. When you declare

a cursor, you get a pointer variable, which does not point any thing. When the cursor is opened,

memory is allocated and the cursor structure is created. The cursor variable now points the

cursor. When the cursor is closed the memory allocated for the cursor is released.

Cursors allow the programmer to retrieve data from a table and perform actions on that data one

row at a time. There are two types of cursors implicit cursors and explicit cursors.

5.1 Implicit cursors

For SQL queries returning single row PL/SQL declares implicit cursors. Implicit cursors are

simple SELECT statements and are written in the BEGIN block (executable section) of the

PL/SQL. Implicit cursors are easy to code, and they retrieve exactly one row. PL/SQL implicitly

declares cursors for all DML statements.

The most commonly raised exceptions here are NO_DATA_FOUND or TOO_MANY_ROWS.

Syntax:

SELECT Ename , sal INTO ena ,esa FROM EMP WHERE EMPNO =7845;

Note: Ename and sal are columns of the table EMP and ena and esa are the variables

used to store ename and sal fetched by the query.

5.2 Explicit Cursors

Explicit cursors are used in queries that return multiple rows. The set of rows fetched by a query

is called active set. The size of the active set meets the search criteria in the select statement.

Explicit cursor is declared in the DECLARE section of PL/SQL program.

Syntax:

CURSOR <cursor-name> IS <select statement>

Sample Code:

DECLARE

CURSOR emp_cur IS SELECT ename FROM EMP;

BEGIN

----

---

END;

Processing multiple rows is similar to file processing. For processing a file you need to open it,

process records and then close. Similarly user-defined explicit cursor needs to be opened, before

reading the rows, after which it is closed. Like how file pointer marks current position in file

processing, cursor marks the current position in the active set.

5.3 Opening Cursor

Syntax: OPEN <cursor-name>;

Example:

OPEN emp_cur;

When a cursor is opened the active set is determined, the rows satisfying the where clause in the

select statement are added to the active set. A pointer is established and points to the first row in

the active set.

5.4 Fetching from the cursor: To get the next row from the cursor we need to use fetch

statement.

Syntax: FETCH <cursor-name> INTO <variables>;

Example: FETCH emp_cur INTO ena;

FETCH statement retrieves one row at a time. Bulk collect clause need to be used to fetch more

than one row at a time. Closing the cursor: After retrieving all the rows from active set the

cursor should be closed. Resources allocated for the cursor are now freed. Once the cursor is

closed the execution of fetch statement will lead to errors.

CLOSE <cursor-name>;

5.5 Explicit Cursor Attributes

Every cursor defined by the user has 4 attributes. When appended to the cursor name these

attributes let the user access useful information about the execution of a multi row query.

The attributes are:

1. %NOTFOUND: It is a Boolean attribute, which evaluates to true, if the last fetch failed.

i.e. when there are no rows left in the cursor to fetch.

2. %FOUND: Boolean variable, which evaluates to true if the last fetch, succeeded.

3. %ROWCOUNT: It’s a numeric attribute, which returns number of rows fetched by the

cursor so far.

4. %ISOPEN: A Boolean variable, which evaluates to true if the cursor is opened otherwise

to false.

In above example I wrote a separate fetch for each row, instead loop statement could be used

here. Following example explains the usage of LOOP.

6. Exceptions

An Exception is an error situation, which arises during program execution. When an error occurs

exception is raised, normal execution is stopped and control transfers to exception-handling part.

Exception handlers are routines written to handle the exception. The exceptions can be internally

defined (system-defined or pre-defined) or User-defined exception.

6.1 Predefined exception is raised automatically whenever there is a violation of Oracle coding

rules. Predefined exceptions are those like ZERO_DIVIDE, which is raised automatically when

we try to divide a number by zero. Other built-in exceptions are given below. You can handle

unexpected Oracle errors using OTHERS handler. It can handle all raised exceptions that are not

handled by any other handler. It must always be written as the last handler in exception block.

CURSOR_ALREADY_OPEN – Raised when we try to open an already open cursor.

DUP_VAL_ON_INDEX – When you try to insert a duplicate value into a unique column

INVALID_CURSOR – It occurs when we try accessing an invalid cursor

INVALID_NUMBER – On usage of something other than number in place of number

value.

LOGIN_DENIED – At the time when user login is denied

TOO_MANY_ROWS – When a select query returns more than one row and the

destination variable can take only single value.

VALUE_ERROR – When an arithmetic, value conversion, truncation, or constraint error

occurs.

Predefined exception handlers are declared globally in package STANDARD. Hence we need

not have to define them rather just use them.

The biggest advantage of exception handling is it improves readability and reliability of the code.

Errors from many statements of code can be handles with a single handler. Instead of checking

for an error at every point we can just add an exception handler and if any exception is raised it is

handled by that.

For checking errors at a specific spot it is always better to have those statements in a separate

begin – end block.

Examples 1: Following example gives the usage of ZERO_DIVIDE exception

Exmpmple 2: I have explained the usage of NO_DATA_FOUND exception in the following

The DUP_VAL_ON_INDEX is raised when a SQL statement tries to create a duplicate value in

a column on which a primary key or unique constraints are defined.

Example: To demonstrate the exception DUP_VAL_ON_INDEX.

More than one Exception can be written in a single handler as shown below.

EXCEPTION

When NO_DATA_FOUND or TOO_MANY_ROWS then

Statements;

END;

6.2 User-defined Exceptions A User-defined exception has to be defined by the programmer. User-defined exceptions are

declared in the declaration section with their type as exception. They must be raised explicitly

using RAISE statement, unlike pre-defined exceptions that are raised implicitly. RAISE

statement can also be used to raise internal exceptions.

Declaring Exception: DECLARE

myexception EXCEPTION;

BEGIN

------

Raising Exception:

BEGIN

RAISE myexception;

-------

Handling Exception: BEGIN

------

----

EXCEPTION

WHEN myexception THEN

Statements;

END;

Points To Ponder:

An Exception cannot be declared twice in the same block.

Exceptions declared in a block are considered as local to that block and global to its sub-

blocks.

An enclosing block cannot access Exceptions declared in its sub-block. Where as it

possible for a sub-block to refer its enclosing Exceptions.

The following example explains the usage of User-defined Exception

RAISE_APPLICATION_ERROR

To display your own error messages one can use the built-in RAISE_APPLICATION_ERROR.

They display the error message in the same way as Oracle errors. You should use a negative

number between –20000 to –20999 for the error_number and the error message

should not exceed 512 characters. The syntax to call raise_application_error is

RAISE_APPLICATION_ERROR (error_number, error_message, { TRUE | FALSE })

Fetch is used twice in the above example to make % FOUND available.

Using Cursor For Loop:

The cursor for Loop can be used to process multiple records. There are two benefits with cursor

for Loop

1. It implicitly declares a %ROWTYPE variable, also uses it as LOOP index

2. Cursor For Loop itself opens a cursor, read records then closes the cursor

automatically. Hence OPEN, FETCH and CLOSE statements are not necessary in it.

2. Example:

emp_rec is automatically created variable of %ROWTYPE. We have not used OPEN, FETCH ,

and CLOSE in the above example as for cursor loop does it automatically. The above example

can be rewritten as shown in the Fig , with less lines of code. It is called Implicit for Loop.

Deletion or Updation Using Cursor:

In all the previous examples I explained about how to retrieve data using cursors. Now we will

see how to modify or delete rows in a table using cursors. In order to Update or Delete rows, the

cursor must be defined with the FOR UPDATE clause. The Update or Delete statement must be

declared with WHERE CURRENT OF

Following example updates comm of all employees with salary less than 2000 by adding 100 to

existing comm.

7. PL/SQL subprograms

A subprogram is a named block of PL/SQL. There are two types of subprograms in PL/SQL

namely Procedures and Functions. Every subprogram will have a declarative part, an executable

part or body, and an exception handling part, which is optional.

Declarative part contains variable declarations. Body of a subprogram contains executable

statements of SQL and PL/SQL. Statements to handle exceptions are written in exception part.

When client executes a procedure are function, the processing is done in the server. This reduces

network traffic. The subprograms are compiled and stored in the Oracle database as stored

programs and can be invoked whenever required. As they are stored in compiled form when

called they only need to be executed. Hence they save time needed for compilation.

Subprograms provide the following advantages

1. They allow you to write PL/SQL program that meet our need

2. They allow you to break the program into manageable modules.

3. They provide reusability and maintainability for the code.

7.1 Procedures

Procedure is a subprogram used to perform a specific action. A procedure contains two parts

specification and the body. Procedure specification begins with CREATE and ends with

procedure name or parameters list. Procedures that do not take parameters are written without a

parenthesis. The body of the procedure starts after the keyword IS or AS and ends with keyword

END.

In the above given syntax things enclosed in between angular brackets (“< > “) are user

defined and those enclosed in square brackets (“[ ]”) are optional.

OR REPLACE is used to overwrite the procedure with the same name if there is any.

AUTHID clause is used to decide whether the procedure should execute with invoker (current-

user or person who executes it) or with definer (owner or person created) rights

Example

CREATE PROCEDURE MyProc

(ENO NUMBER)

AUTHID DEFINER AS

BEGIN

DELETE FROM EMP

WHERE EMPNO= ENO;

EXCEPTION

WHEN NO_DATA_FOUND THEN

DBMS_OUTPUT.PUT_LINE (‘No employee with this number’);

END;

Let us assume that above procedure is created in SCOTT schema (SCOTT user area) and say is

executed by user SEENU. It will delete rows from the table EMP owned by SCOTT, but not

from the EMP owned by SEENU. It is possible to use a procedure owned by one user on tables

owned by other users. It is possible by setting invoker-rights

AUTHID CURRENT_USER

PRAGMA AUTONOMOUS_TRANSACTION is used to instruct the compiler to treat the

procedure as autonomous. i.e. commit or rollback the changes made by the procedure.

Parameter Modes

Parameters are used to pass the values to the procedure being called. There are 3 modes to be

used with parameters based on their usage. IN, OUT, and IN OUT. IN mode parameter used to

pass the values to the called procedure. Inside the program IN parameter acts like a constant. i.e

it cannot be modified. OUT mode parameter allows you to return the value from the procedure.

Inside Procedure the OUT parameter acts like an uninitialized variable. Therefore its value

cannot be assigned to another variable.

IN OUT mode parameter allows you to both pass to and return values from the subprogram.

Default mode of an argument is IN.

POSITIONAL vs. NOTATIONAL parameters

A procedure can be communicated by passing parameters to it. The parameters passed to a

procedure may follow either positional notation or named notation.

Example

If a procedure is defined as GROSS (ESAL NUMBER, ECOM NUMBER)

If we call this procedure as GROSS (ESA, ECO) then parameters used are called positional

parameters. For Notational Parameters we use the following syntax

GROSS (ECOM => ECO, ESAL => ESA)

A procedure can also be executed by invoking it as an executable statement as shown below.

BEGIN

PROC1; --- PROC1 is name of the procedure.

END;

/

Functions:

A function is a PL/SQL subprogram, which is used to compute a value. Function is same like a

procedure except for the difference that it have RETURN clause.

Syntax for Function

Examples

Function without arguments

Function with arguments. Different ways of executing the function.

Chapter-5 PROCEDURAL QUERY LANGUAGE

End Chapter quizzes

Q1 Select the correct statement

c) User-defined exceptions are defined by the programmer d) PL/SQL improves the capacity of SQL

e) %NOTFOUND: It is a Boolean attribute

f) All of the above

Q2) Select the correct statement

a) Declaration section is optional.

b) The Execution section is mandatory

c) The Exception (or Error) Handling section is mandatory.

d) Only a and c is correct.

Q3. A command used to access results from Oracle Server

a) SET SERVEROUTPUT ON

b) PRINT

c) WRITE

d) OUTPUT_SERVER

Q4. Which cursors are used in queries that return multiple rows?

a) Explicit cursor

b) Implicit cursors

c) Open Cursor

d) Both a and c

Q5 Program logic of PL SQL is written in:

a) Declaration section

b) Execution Section

c) Exception Handling

d) Program Section.

Q6 Variable and Constants are declared in

a) Variable Section

b) Declaration Section

c) Execution Section

d) Program Section

Q7. There are two types of subprograms in PL/SQL namely

a) Procedures

b) Cursor

c) Functions d) Both a and c

Q8. User-defined exception has to be defined by

a) Programmer

b) User

c) Technical Writer

d) None

Q9. Biggest advantage of exception handling is it improves

a) Readability

b) Reliability

c) Both a and b

d) None

Q10. NO_DATA_FOUND or TOO_MANY_ROWS. are

a) most commonly used function

b) most commonly used raised exceptions

c) Triggers

d) Procedures

Chapter: 6

TRANSACTION MANAGEMENT & CONCURRENCY

CONYROL TECHNIQUE

1. Introductory Concept to Database Transaction

A database transaction comprises of a logical unit of work performed within a database

management system (or similar system) against a database, and treated in a coherent and reliable

way independent of other transactions.

Transactions in a database environment have two main purposes:

1. To provide reliable units of work that allow correct recovery from failures and keep a database

consistent even in cases of system failure, when execution stops (completely or partially) and

many operations upon a database remain uncompleted, with unclear status.

2. To provide isolation between programs accessing a database concurrently. Without isolation

the programs' outcomes are possibly erroneous.

A database transaction, by definition, must be atomic, consistent, isolated and durable.

Database practitioners often refer to these properties of database transactions using the acronym

ACID.

Transactions provide an "all-or-nothing" proposition, stating that each work-unit performed in a

database must either complete in its entirety or have no effect whatsoever. Further, the system

must isolate each transaction from other transactions, results must conform to existing

constraints in the database, and transactions that complete successfully must get written to

durable storage.

Most modern relational database management systems fall into the category of databases that

support transactions: transactional databases.

In a database system a transaction might consist of one or more data-manipulation statements

and queries, each reading and/or writing information in the database. Users of database systems

consider consistency and integrity of data as highly important. A simple transaction is usually

issued to the database system in a language like SQL wrapped in a transaction, using a pattern

similar to the following:

1. Begin the transaction

2. Execute several data manipulations and queries

3. If no errors occur then commit the transaction and end it

4. If errors occur then rollback the transaction and end it

If no errors occurred during the execution of the transaction then the system commits the

transaction. A transaction commit operation applies all data manipulations within the scope of

the transaction and persists the results to the database. If an error occurs during the transaction,

or if the user specifies a rollback operation, the data manipulations within the transaction are not

persisted to the database. In no case can a partial transaction be committed to the database since

that would leave the database in an inconsistent state.

Internally, multi-user databases store and process transactions, often by using a transaction ID or

XID.

2. ACID properties

When a transaction processing system creates a transaction, it will ensure that the transaction

will have certain characteristics. The developers of the components that comprise the transaction

are assured that these characteristics are in place. They do not need to manage these

characteristics themselves. These characteristics are known as the ACID properties. ACID is an

acronym for atomicity, consistency, isolation, and durability.

2.1 Atomicity

The atomicity property identifies that the transaction is atomic. An atomic transaction is either

fully completed, or is not begun at all. Any updates that a transaction might affect on a system

are completed in their entirety. If for any reason an error occurs and the transaction is unable to

complete all of its steps, the then system is returned to the state it was in before the transaction

was started. An example of an atomic transaction is an account transfer transaction. The money

is removed from account A then placed into account B. If the system fails after removing the

money from account A, then the transaction processing system will put the money back into

account A, thus returning the system to its original state. This is known as a rollback, as we said

at the beginning of this chapter..

2.2 Consistency

A transaction enforces consistency in the system state by ensuring that at the end of any

transaction the system is in a valid state. If the transaction completes successfully, then all

changes to the system will have been properly made, and the system will be in a valid state. If

any error occurs in a transaction, then any changes already made will be automatically rolled

back. This will return the system to its state before the transaction was started. Since the system

was in a consistent state when the transaction was started, it will once again be in a consistent

state.

Looking again at the account transfer system, the system is consistent if the total of all accounts

is constant. If an error occurs and the money is removed from account A and not added to

account B, then the total in all accounts would have changed. The system would no longer be

consistent. By rolling back the removal from account A, the total will again be what it should be,

and the system back in a consistent state.

2.3 Isolation

When a transaction runs in isolation, it appears to be the only action that the system is carrying

out at one time. If there are two transactions that are both performing the same function and are

running at the same time, transaction isolation will ensure that each transaction thinks it has

exclusive use of the system. This is important in that as the transaction is being executed, the

state of the system may not be consistent. The transaction ensures that the system remains

consistent after the transaction ends, but during an individual transaction, this may not be the

case. If a transaction was not running in isolation, it could access data from the system that may

not be consistent. By providing transaction isolation, this is prevented from happening.

2.4 Durability

A transaction is durable in that once it has been successfully completed, all of the changes it

made to the system are permanent. There are safeguards that will prevent the loss of information,

even in the case of system failure. By logging the steps that the transaction performs, the state of

the system can be recreated even if the hardware itself has failed. The concept of durability

allows the developer to know that a completed transaction is a permanent part of the system,

regardless of what happens to the system later on.

3 The Concept of Schedules

When transactions are executing concurrently in an interleaved fashion, not only does the

action of each transaction becomes important, but also the order of execution of operations from

each of these transactions. Hence, for analyzing any problem, it is not just the history of previous

transactions that one should be worrying about, but also the “schedule” of operations.

3.1 Schedule (History of transaction):

We formally define a schedule S of n transactions T1, T2 …Tn as on ordering of operations of

the transactions subject to the constraint that, for each transaction, Ti that participates in S, the

operations of Ti must appear in the same order in which they appear in Ti. i.e. if two operations

Ti1 and Ti2 are listed in Ti such that Ti1 is earlier to Ti2, then in the schedule also Ti1 should appear

before Ti2. However, if Ti2 appears immediately after Ti1 in Ti, the same may not be true in S,

because some other operations Tj1 (of a transaction Tj) may be interleaved between them. In

short, a schedule lists the sequence of operations on the database in the same order in which

it was effected in the first place.

For the recovery and concurrency control operations, we concentrate mainly on read and write

operations of the transactions, because these operations actually effect changes to the database.

The other two (equally) important operations are commit and abort, since they decide when the

changes effected have actually become active on the database.

Since listing each of these operations becomes a lengthy process, we make a notation for

describing the schedule. The read operations (Readtr) , write operations(Writetr) of transactions

, commit and abort, we indicate by r, w, c and a and each of them come with a subscript to

indicate the transaction number

For example SA : r1(x); y2(y); w2(y); r1(y), W1 (x); a1

Indicates the following operations in the same order:

Readtr(x) transaction 1

Read tr (y) transaction 2

Write tr (y) transaction 2

Read tr(y) transaction 1

Write tr(x) transaction 1

Abort transaction 1

3.2 Conflicting operations: Two operations in a schedule are said to be in conflict if they satisfy

these conditions

i) The operations belong to different transactions

ii) They access the same item x

iii) At least one of the operations is a write operation.

For example: r1(x); w2 (x)

W1 (x); r2(x)

w1 (y); w2(y)

Conflict because both of them try to write on the same item.

But r1 (x); w2(y) and r1(x) and r2(x) do not conflict, because in the first case the read and

write are on different data items, in the second case both are trying read the same data

item, which they can do without any conflict.

3.3 A Complete Schedule: A schedule S of n transactions T1, T2…….. Tn is said to be a

“Complete Schedule” if the following conditions are satisfied.

i) The operations listed in S are exactly the same operations as in T1, T2 ……Tn,

including the commit or abort operations. Each transaction is terminated by either a

commit or an abort operation.

ii) The operations in any transaction. Ti appear in the schedule in the same order in

which they appear in the Transaction.

iii) Whenever there are conflicting operations, one of two will occur before the other in

the schedule.

A “Partial order” of the schedule is said to occur, if the first two conditions of the complete

schedule are satisfied, but whenever there are non conflicting operations in the schedule, they

can occur without indicating which should appear first.

This can happen because non conflicting operations any way can be executed in any order

without affecting the actual outcome.

However, in a practical situation, it is very difficult to come across complete schedules. This is

because new transactions keep getting included into the schedule. Hence, often one works with a

“committed projection” C(S) of a schedule S. This set includes only those operations in S that

have committed transactions i.e. transaction Ti whose commit operation Ci is in S.

Put in simpler terms, since non committed operations do not get reflected in the actual outcome

of the schedule, only those transactions, who have completed their commit operations, contribute

to the set and this schedule is good enough in most cases.

3.4 Schedules and Recoverability :

Recoverability is the ability to recover from transaction failures. The success or otherwise of

recoverability depends on the schedule of transactions. If fairly straightforward operations

without much interleaving of transactions are involved, error recovery is a straight forward

process. On the other hand, if lot of interleaving of different transactions have taken place, then

recovering from the failure of any one of these transactions could be an involved affair. In

certain cases, it may not be possible to recover at all. Thus, it would be desirable to characterize

the schedules based on their recovery capabilities.

To do this, we observe certain features of the recoverability and also of schedules. To begin

with, we note that any recovery process, most often involves a “roll back” operation, wherein

the operations of the failed transaction will have to be undone. However, we also note that

the roll back need to go only as long as the transaction T has not committed. If the

transaction T has committed once, it need not be rolled back. The schedules that satisfy this

criterion are called “recoverable schedules” and those that do not, are called “non-

recoverable schedules”. As a rule, such non-recoverable schedules should not be permitted.

Formally, a schedule S is recoverable if no transaction T which appears is S commits, until all

transactions T1 that have written an item which is read by T have committed. The concept is a

simple one. Suppose the transaction T reads an item X from the database completes its

operations (based on this and other values) and commits the values. i.e. the output values of T

become permanent values of database.

But suppose, this value X is written by another transaction T’ (before it is read by T), but

aborts after T has committed. What happens? The values committed by T are no more valid,

because the basis of these values (namely X) itself has been changed. Obviously T also needs to

be rolled back (if possible), leading to other rollbacks and so on.

The other aspect to note is that in a recoverable schedule, no committed transaction needs to be

rolled back. But, it is possible that a cascading roll back scheme may have to be effected, in

which an uncommitted transaction has to be rolled back, because it read from a value contributed

by a transaction which later aborted. But such cascading rollbacks can be very time consuming

because at any instant of time, a large number of uncommitted transactions may be operating.

Thus, it is desirable to have “cascadeless” schedules, which avoid cascading rollbacks.

This can be ensured by ensuring that transactions read only those values which are written by

committed transactions i.e. there is no fear of any aborted or failed transactions later on. If the

schedule has a sequence wherein a transaction T1 has to read a value X by an uncommitted

transaction T2, then the sequence is altered, so that the reading is postponed, till T2 either

commits or aborts.

This delays T1, but avoids any possibility of cascading rollbacks.

The third type of schedule is a “strict schedule”, which as the name suggests is highly

restrictive in nature. Here, transactions are allowed neither to read nor write a value X

until the last transaction that wrote X has committed or aborted. Note that the strict

schedule largely simplifies the recovery process, but the many cases, it may not be

possible device strict schedules.

It may be noted that the recoverable schedule, cascadeless schedules and strict schedules each is

more stringent than it’s predecessor. It facilitates the recovery process, but sometimes the

process may get delayed or even may become impossible to schedule.

4 Serializability

Given two transaction T1 and T2 are to be scheduled, they can be scheduled in a number of ways.

The simplest way is to schedule them without in that bothering about interleaving them. i.e.

schedule all operation of the transaction T1 followed by all operations of T2 or alternatively

schedule all operations of T2 followed by all operations of T1.

T1 T2

read_tr(X)

X=X+N

write_tr(X)

read_tr(Y)

Y=Y+N

Write_tr(Y)

Time read_tr(X)

X=X+P

Write_tr(X)

Non-interleaved (Serial Schedule): A

T1 T2

read_tr(X)

X=X+P

Write_tr(X)

read_tr(X)

X=X+N

write_tr(X )

read_tr(Y)

Y=Y+N

Write_tr(Y)

Non-interleaved (Serial Schedule):B

These now can be termed as serial schedules, since the entire sequence of operation in one

transaction is completed before the next sequence of transactions is started.

In the interleaved mode, the operations of T1 are mixed with the operations of T2. This can be

done in a number of ways. Two such sequences are given below:

T1 T2

read_tr(X )

X=X+N

read_tr(X)

X=X+P

write_tr(X)

read_tr(Y)

Write_tr(X)

Y=Y+N

Write_tr(Y)

Interleaved (non-serial schedule): C

T1 T2

read_tr(X)

X=X+N

write_tr(X)

read_tr(X)

X=X+P

Write_tr(X)

read_tr(Y)

Y=Y+N

Write_tr(Y)

Interleaved (non- serial) Schedule D.

Formally a schedule S is serial if, for every transaction, T in the schedule, all operations of T are

executed consecutively, otherwise it is called non serial. In such a non-interleaved schedule, if

the transactions are independent, one can also presume that the schedule will be correct, since

each transaction commits or aborts before the next transaction begins. As long as the

transactions individually are error free, such sequences of events are guaranteed to give correct

results.

The problem with such a situation is the wastage of resources. If in a serial schedule, one

of the transactions is waiting for an I/O, the other transactions also cannot use the system

resources and hence the entire arrangement is wasteful of resources. If some transaction T is

very long, the other transaction will have to keep waiting till it is completed. Moreover, wherein

hundreds of machines operate concurrently becomes unthinkable. Hence, in general, the serial

scheduling concept is unacceptable in practice.

However, once the operations are interleaved, so that the above cited problems are overcome,

unless the interleaving sequence is well thought of, all the problems that we encountered in the

beginning of this block become addressable. Hence, a methodology is to be adopted to find out

which of the interleaved schedules give correct results and which do not.

A schedule S of N transactions is “serializable” if it is equivalent to some serial schedule

of the some N transactions. Note that there are n different serial schedules possible to

be made out of n transaction. If one goes about interleaving them, the numbers of

possible combinations become unmanageably high. To ease our operations, we form

two disjoint groups of non serial schedules- these non serial schedules that are

equivalent to one or more serial schedules, which we call “serializable schedules” and

those that are not equivalent to any serial schedule and hence are not serializable once

a non-serial schedule is serializable, it becomes equivalent to a serial schedule and by

our previous definition of serial schedule will become a “correct” schedule. But now can

one prove the equivalence of a non-serial schedule to a serial schedule?

The simplest and the most obvious method to conclude that two such schedules are

equivalent is to find out their results. If they produce the same results, then they can be

considered equivalent. i.e. it two schedules are “result equivalent”, then they can be considered

equivalent. But such an oversimplification is full of problems. Two sequences may produce the

same set of results of one or even a large number of initial values, but still may not be equivalent.

Consider the following two sequences:

S1 S2

read_tr(X) read_tr(X)

X=X+X X=X*X

write_tr(X) Write_tr(X)

For a value X=2, both produce the same result. Can be conclude that they are equivalent?

Though this may look like a simplistic example, with some imagination, one can always come

out with more sophisticated examples wherein the “bugs” of treating them as equivalent are less

obvious. But the concept still holds -result equivalence cannot mean schedule equivalence. One

more refined method of finding equivalence is available. It is called “ conflict equivalence”.

Two schedules can be said to be conflict equivalent, if the order of any two conflicting

operations in both the schedules is the same (Note that the conflicting operations essentially

belong to two different transactions and if they access the same data item, and atleast one of

them is a write _tr(x) operation). If two such conflicting operations appear in different orders in

different schedules, then it is obvious that they produce two different databases in the end and

hence they are not equivalent.

4.1 Testing for conflict serializability of a schedule:

We suggest an algorithm that tests a schedule for conflict serializability.

1. For each transaction Ti, participating in the schedule S, create a node labeled T1 in the

precedence graph.

2. For each case where Tj executes a readtr(x) after Ti executes write_tr(x), create an

edge from Ti to Tj in the precedence graph.

3. For each case where Tj executes write_tr(x) after Ti executes a read_tr(x), create an

edge from Ti to Tj in the graph.

4. For each case where Tj executes a write_tr(x) after Ti executes a write_tr(x), create

an edge from Ti to Tj in the graph.

5. The schedule S is serialisable if and only if there are no cycles in the graph.

If we apply these methods to write the precedence graphs for the four cases of section 4,

we get the following precedence graphs.

X

X

Schedule A Schedule B

X

X

Schedule C Schedule D

We may conclude that schedule D is equivalent to schedule A.

4.2. View equivalence and view serializability:

Apart from the conflict equivalence of schedules and conflict serializability, another

restrictive equivalence definition has been used with reasonable success in the context

of serializability. This is called view serializability.

Two schedules S and S1 are said to be “view equivalent” if the following conditions are

satisfied.

i) The same set of transactions participates in S and S1 and S and S

1 include the

same operations of those transactions.

T1 T2

T1 T2

T1 T2

T1 T2

ii) For any operation ri(X) of Ti in S, if the value of X read by the operation has been

written by an operation wj(X) of Tj(or if it is the original value of X before the

schedule started) the same condition must hold for the value of x read by

operation ri(X) of Ti in S1.

iii) If the operation Wk(Y) of Tk is the last operation to write, the item Y in S, then

Wk(Y) of Tk must also be the last operation to write the item y in S1.

The concept being view equivalent is that as long as each read operation of the

transaction reads the result of the same write operation in both the schedules, the

write operations of each transaction must produce the same results. Hence, the

read operations are said to see the same view of both the schedules. It can easily

be verified when S or S1 operate independently on a database with the same initial

state, they produce the same end states. A schedule S is said to be view

serializable, if it is view equivalent to a serial schedule.

It can also be verified that the definitions of conflict serializability and view serializability are similar, if a condition of “constrained write

assumption” holds on all transactions of the schedules. This condition states that any write operation wi(X) in Ti is preceded by a ri(X) is

Ti and that the value written by wi(X) in Ti depends only on the value of X read by ri(X). This assumes that computation of the new value of X is a function f(X) based on the old value of x read from the database. However, the definition of view serializability is less restrictive

than that of conflict serializability under the “unconstrained write assumption” where the value written by the operation Wi(x) in Ti can be

independent of it’s old value from the database. This is called a “blind write”.

But the main problem with view serializability is that it is extremely complex

computationally and there is no efficient algorithm to do the same.

4.3 Uses of serializability:

If one were to prove the serializability of a schedule S, it is equivalent to saying that S is

correct. Hence, it guarantees that the schedule provides correct results. But being

serializable is not the same as being serial. A serial scheduling inefficient because of the

reasons explained earlier, which leads to under utilization of the CPU, I/O devices and in

some cases like mass reservation system, becomes untenable. On the other hand, a

serializable schedule combines the benefits of concurrent execution (efficient system

utilization, ability to cater to larger no of concurrent users) with the guarantee of correctness.

But all is not well yet. The scheduling process is done by the operating system routines

after taking into account various factors like system load, time of transaction submission,

priority of the process with reference to other process and a large number of other factors.

Also since a very large number of possible interleaving combinations are possible, it is

extremely difficult to determine before hand the manner in which the transactions are

interleaved. In other words getting the various schedules itself is difficult, let alone

testing them for serializability.

Hence, instead of generating the schedules, checking them for serializability and then

using them, most DBMS protocols use a more practical method – impose restrictions on

the transactions themselves. These restrictions, when followed by every participating

transaction, automatically ensure serializability in all schedules that are created by these

participating schedules.

Also, since transactions are being submitted at different times, it is difficult to determine

when a schedule begins and when it ends. Hence serializability theory can be used to

deal with the problem by considering only the committed projection C(CS) of the

schedule. Hence, as an approximation, we can define a schedule S as serializable if it’s

committed C(CS) is equivalent to some serial schedule.

5. The need for concurrency control

Let us imagine a situation wherein a large number of users (probably spread over vast

geographical areas) are operating on a concurrent system. Several problems can occur if they are

allowed to execute their transactions operations in an uncontrolled manner.

Consider a simple example of a railway reservation system. Since a number of people

are accessing the database simultaneously, it is obvious that multiple copies of the transactions

are to be provided so that each user can go ahead with his operations. Let us make the concept a

little more specific. Suppose we are considering the number of reservations in a particular train

of a particular date. Two persons at two different places are trying to reserve for this train. By

the very definition of concurrency, each of them should be able to perform the operations

irrespective of the fact that the other person is also doing the same. In fact they will not even

know that the other person is also booking for the same train. The only way of ensuring the

same is to make available to each of these users their own copies to operate upon and finally

update the master database at the end of their operation.

Now suppose there are 10 seats are available. Both the persons, say A and B want to get

this information and book their seats. Since they are to be accommodated concurrently, the

system provides them two copies of the data. The simple way is to perform a read –tr (X) so that

the value of X is copied on to the variable X of person A (let us call it XA) and of the person B

(XB). So each of them know that there are 10 seats available.

Suppose A wants to book 8 seats. Since the number of seats he wants is (say Y) less than

the available seats, the program can allot him the seats, change the number of available seats (X)

to X-Y and can even give him the seat numbers that have been booked for him.

The problem is that a similar operation can be performed by B also. Suppose he needs 7

seats. So, he gets his seven seats, replaces the value of X to 3 (10 – 7) and gets his reservation.

The problem is noticed only when these blocks are returned to main database (the

disk in the above case).

Before we can analyze these problems, we look at the problem from a more technical

view.

5.1 The lost update problem: This problem occurs when two transactions that access the same

database items have their operations interleaved in such a way as to make the value of some

database incorrect. Suppose the transactions T1 and T2 are submitted at the (approximately)

same time. Because of the concept of interleaving, each operation is executed for some period of

time and then the control is passed on to the other transaction and this sequence continues.

Because of the delay in updating, this creates a problem. This was what happened in the

previous example. Let the transactions be called TA and TB.

TA TB

Read –tr(X)

Read –tr(X) Time

X = X – NA

X = X - NB

Write –tr(X)

Write –tr(X)

Note that the problem occurred because the transaction TB failed to record the

transactions TA. I.e. TB lost on TA. Similarly since TA did the writing later on, TA lost the

updating of TB.

5.2 Dirty read problem

This happens when a transaction TA updates a data item, but later on (for some reason) the

transaction fails. It could be due to a system failure or any other operational reason or the system

may have later on noticed that the operation should not have been done and cancels it. To be

fair, it also ensures that the original value is restored.

But in the meanwhile, another transaction TB has accessed the data and since it has no indication

as to what happened later on, it makes use of this data and goes ahead. Once the original value is

restored by TA, the values generated by TB are obviously invalid.

TA TB

Read –tr(X) Time

X = X – N

Write –tr(X)

Read –tr(X)

X = X - N

Write –tr(X)

Failure

X = X + N

Write –tr(X)

The value generated by TA out of a non-sustainable transaction is a “dirty data” which is read by

TB, produces an illegal value. Hence the problem is called a dirty read problem.

5.3 The Incorrect Summary Problem: Consider two concurrent operations, again called TA and

TB. TB is calculating a summary (average, standard deviation or some such operation) by

accessing all elements of a database (Note that it is not updating any of them, only is reading

them and is using the resultant data to calculate some values). In the meanwhile TA is updating

these values. In case, since the Operations are interleaved, TA, for some of it’s operations will be

using the not updated data, whereas for the other operations will be using the updated data. This

is called the incorrect summary problem.

TA TB

Sum = 0

Read –tr(A)

Sum = Sum + A

Read –tr(X)

X = X – N

Write –tr(X)

Read tr(X)

Sum = Sum + X

Read –tr(Y)

Sum = Sum + Y

Read (Y)

Y = Y – N

Write –tr(Y)

In the above example, both TA will be updating both X and Y. But since it first updates X

and then Y and the operations are so interleaved that the transaction TB uses both of them in

between the operations, it ends up using the old value of Y with the new value of X. In the

process, the sum we got does not refer either to the old set of values or to the new set of

values.

6 Locking techniques for concurrency control

Many of the important techniques for concurrency control make use of the concept of the lock. A

lock is a variable associated with a data item that describes the status of the item with respect to

the possible operations that can be done on it. Normally every data item is associated with a

unique lock. They are used as a method of synchronizing the access of database items by the

transactions that are operating concurrently. Such controls, when implemented properly can

overcome many of the problems of concurrent operations listed earlier. However, the locks

themselves may create a few problems, which we shall be seeing in some detail in subsequent

sections.

6.1 Types of locks and their uses:

6.1.1: Binary locks: A binary lock can have two states or values (1 or 0) one of them indicates

that it is locked and the other says it is unlocked. For example if we presume 1 indicates that the

lock is on and 0 indicates it is open, then if the lock of item(X) is 1 then the read_tr(x) cannot

access the time as long as the lock’s value continues to be 1. We can refer to such a state as lock

(x).

The concept works like this. The item x can be accessed only when it is free to be used by the

transactions. If, say, it’s current value is being modified, then X cannot be (in fact should not be)

accessed, till the modification is complete. The simple mechanism is to lock access to X as long

as the process of modification is on and unlock it for use by the other transactions only when the

modifications are complete.

So we need two operations lock item(X) which locks the item and unlock item(X) which opens

the lock. Any transaction that wants to makes use of the data item, first checks the lock status of

X by the lock item(X). If the item X is already locked, (lock status=1) the transaction will have

to wait. Once the status becomes = 0, the transaction accesses the item, and locks it (makes it’s

status=1). When the transaction has completed using the item, it issues an unlock item (X)

command, which again sets the status to 0, so that other transactions can access the item.

6.1.2 Shared and Exclusive locks

While the operation of the binary lock scheme appears satisfactory, it suffers from a

serious drawback. Once a transaction holds a lock (has issued a lock operation), no

other transaction can access the data item. But in large concurrent systems, this can

become a disadvantage. It is obvious that more than one transaction should not go on

writing into X or while one transaction is writing into it, no other transaction should be

reading it, no harm is done if several transactions are allowed to simultaneously read

the item. This would save the time of all these transactions, without in anyway affecting

the performance.

This concept gave rise to the idea of shared/exclusive locks. When only read operations are

being performed, the data item can be shared by several transactions, only when a transaction

wants to write into it that the lock should be exclusive. Hence the shared/exclusive lock is also

sometimes called multiple mode lock. A read lock is a shared lock (which can be used by

several transactions), whereas a write lock is an exclusive lock. So, we need to think of three

operations, a read lock, a write lock and unlock. The algorithms can be as follows:

Read lock (X):

Start: If Lock (X) = “unlocked”

Then {

Lock(X) “read locked”,

No of reads(X) 1

}

else if Lock(X) = “read locked”

then no. of reads(X) = no of reads(X)0+1;

else { wait until Lock(X) “unlocked” and the lock manager

wakes up the transaction) }

go to start

end.

Read Lock Operation:

Write lock(X)

Start: If lock(X) = “unlocked”

Then Lock(X) “unlocked”.

Else {wait until Lock(X) = “unlocked” and

The lock manager wakes up the transaction}

Go to start

End;

The write lock operation:

Unlock(X)

If lock(X) = “write locked”

Then {Lock(X) “unlocked”’

Wakeup one of the waiting transaction, if any

}

else if Lock(X) = “read locked”

then { no of reads(X) no of reads –1

if no of reads(X)=0

then { Lock(X) = “unlocked”

wake up one of the waiting transactions, if any

}

}

The Unlock Operation:

The algorithms are fairly straight forward, except that during the unlocking operation, if a

number of read locks are there, then all of them are to be unlocked before the unit itself becomes

unlocked.

To ensure smooth operation of the shared / exclusive locking system, the system must

enforce the following rules:

1. A transaction T must issue the operation read lock(X) or writelock(X) before any read

or write operations are performed.

2. A transaction T must issue the operation write lock(X) before any writetr(X)

operation is performed on it.

3. A transaction T must issue the operation unlock (X) after all readtr(X) are completed

in T.

4. A transaction T will not issue a read lock(X) operation if it already holds a readlock

or write lock on X.

5. A transaction T will not issue a write lock(X) operation if it already holds a readlock

or write lock on X.

6.1.3 Two phase locking:

A transaction is said to be following a two phase locking if the operation of the

transaction can be divided into two distinct phases. In the first phase, all items that are

needed by the transaction are acquired by locking them. In this phase, no item is

unlocked even if it’s operations are over. In the second phase, the items are unlocked

one after the other. The first phase can be thought of as a growing phase, wherein the

store of locks held by the transaction keeps growing. The second phase is called the

shrinking phase, the no. of locks held by the transaction keep shrinking.

readlock(Y)

readtr(Y) Phase I

writelock(X)

-----------------------------------

unlock(Y)

readtr(X) Phase II

X=X+Y

writetr(X)

unlock(X)

Example: A two phase locking

The two phase locking, though provides serializability has a disadvantage. Since the

locks are not released immediately after the use of the item is over, but is retained till all

the other needed locks are also acquired, the desired amount of interleaving may not be

derived – worse, while a transaction T may be holding an item X, though it is not using

it, just to satisfy the two phase locking protocol, another transaction T1 may be

genuinely needing the item, but will be unable to get it till T releases it. This is the price

that is to be paid for the guaranteed serializability provided by the two phase locking

system.

6.2 Deadlock and Starvation:

A deadlock is a situation wherein each transaction T1 which is in a set of two or more

transactions is waiting for some item that is locked by some other transaction T1 in the set i.e.

taking the case of only two transactions T11 and T2

1 , T1

1 is waiting for an item X which is with

T21

, and is also holding another item Y. T11 will release Y when X becomes available from T2

1

and T11 can complete

some operations. Meanwhile T2

1 is waiting for Y held by T1

1 and T2

1

will release X only Y, held by T11 is released and after T2

1 has performed same operations on

that. It can be easily seen that this is an infinite wait and the dead lock will never get resolved.

T11 T2

1

readlock(Y)

readtr(Y)

readlock(X) The status graph

readtr(X)

writelock(X)

writelock(Y)

A partial schedule leading to Deadlock.

While in the case of only two transactions, it is rather easy to notice the possibility of

deadlock, though preventing it may be difficult. The case may become more complicated, when

more then two transactions are in a deadlock and even identifying a deadlock may be difficult.

6.2.1 Deadlock prevention protocols

The simplest way of preventing deadlock is to look at the problem in detail. Deadlock

occurs basically because a transaction has locked several items, but could not get one more item

T11 T2

1

and is not releasing other items held by it. The solution is to develop a protocol wherein a

transaction will first get all the items that it needs & then only locks them. I.e. if it cannot get any

one/more of the items, it does not hold the other items also, so that these items can be useful to

any other transaction that may be needing them. Their method, though prevents deadlocks,

further limits the prospects of concurrency.

A better way to deal with deadlocks is to identify the deadlock when it occurs and then

take some decision. The transaction involved in the deadlock may be blocked or aborted or the

transaction can preempt and abort the other transaction involved. In a typical case, the concept

of transaction time stamp TS (T) is used. Based on when the transaction was started, (given by

the time stamp, larger the value of TS, younger is the transaction), two methods of deadlock

recovery are devised.

1. Wait-die method: suppose a transaction Ti tries to lock an item X, but is unable to do

so because X is locked by Tj with a conflicting lock. Then if TS(Ti)<TS(Tj), (Ti is older then

Tj) then Ti waits. Otherwise (if Ti is younger than Tj) then Ti is aborted and restarted later with

the same time stamp. The policy is that the older of the transactions will have already spent

sufficient efforts & hence should not be aborted.

2. Wound-wait method: If TS(Ti) <TS(Tj), (Ti is older then Tj), abort and restart Tj

with the same time stamp later. On the other hand, if Ti is younger then Ti is allowed to wait.

It may be noted that in both cases, the younger transaction will get aborted. But the actual

method of aborting is different. Both these methods can be proved to be deadlock free, because

no cycles of waiting as seen earlier are possible with these arrangements.

There is another class of protocols that do not require any time stamps. They include the “no

waiting algorithm” and the “cautious waiting” algorithms. In the no-waiting algorithm, if a

transaction cannot get a lock, it gets aborted immediately (no-waiting). It is restarted again at a

later time. But since there is no guarantee that the new situation. is dead lock free, it may have to

aborted again. This may lead to a situation where a transaction may end up getting aborted

repeatedly.

To overcome this problem, the cautious waiting algorithm was proposed. Here, suppose the

transaction Ti tries to lock an item X, but cannot get X since X is already locked by another

transaction Tj. Then the solution is as follows: If Tj is not blocked (not waiting for same other

locked item) then Ti is blocked and allowed to wait. Otherwise Ti is aborted. This method not

only reduces repeated aborting, but can also be proved to be deadlock free, since out of Ti & Tj,

only one is blocked, after ensuring that the other is not blocked.

6.2.2 Deadlock detection & timeouts:

The second method of dealing with deadlocks is to detect deadlocks as and when they happen.

The basic problem with the earlier suggested protocols is that they assume that we know what is

happening in the system – which transaction is waiting for which item and so on. But in a

typical case of concurrent operations, the situation is fairly complex and it may not be possible to

predict the behavior of transaction.

In such cases, the easier method is to take on deadlocks as and when they happen and try

to solve them. A simple way to detect a deadlock is to maintain a “wait for”graph. One node in

the graph is created for each executing transaction. Whenever a transaction Ti is waiting to lock

an item X which is currently held by Tj, an edge (Ti Tj) is created in their graph. When Tj

releases X, this edge is dropped. It is easy to see that whenever there is a deadlock situation,

there will be loops formed in the “wait-for” graph, so that suitable corrective action can be taken.

Again, once a deadlock has been detected, the transaction to be aborted is to be chosen. This is

called the “victim selection” and generally newer transactions are selected for victimization.

Another easy method of dealing with deadlocks is the use of timeouts. Whenever a

transaction is made to wait for periods longer than a predefined period, the system assumes that a

deadlock has occurred and aborts the transaction. This method is simple & with low overheads,

but may end up removing the transaction, even when there is no deadlock.

6.3 Starvation:

The other side effect of locking in starvation, which happens when a transaction cannot proceed

for indefinitely long periods, though the other transactions in the system, are continuing

normally. This may happen if the waiting schemes for locked items is unfair. I.e. if some

transactions may never be able to get the items, since one or the other of the high priority

transactions may continuously be using them. Then the low priority transaction will be forced to

“starve” for want of resources.

The solution to starvation problems lies in choosing proper priority algorithms – like first-come-

first serve. If this is not possible, then the priority of a transaction may be increased every time it

is made to wait / aborted, so that eventually it becomes a high priority transaction and gets the

required services.

6.4 Concurrency control based on Time Stamp ordering

6.4.1 The Concept of time stamps: A time stamp is a unique identifier created by the DBMS,

attached to each transaction which indicates a value that is measure of when the transaction came

into the system. Roughly, a time stamp can be thought of as the starting time of the transaction,

denoted by TS (T).

They are generated by a counter that is initially zero and is incremented each time it’s

value is assigned to the transaction. The counter is also given a maximum value and if the

reading goes beyond that value, the counter is reset to zero, indicating, most often, that the

transaction has lived it’s life span inside the system and needs to be taken out. A better way of

creating such time stamps is to make use of the system time/date facility or even the internal

clock of the system.

6.4.2 An algorithm for ordering the time stamp: The basic concept is to order the transactions

based on their time stamps. A schedule made of such transactions is then serializable. This

concept is called the time stamp ordering (To). The algorithm should ensure that whenever a

data item is accessed by conflicting operations in the schedule, the data is available to them in

the serializability order. To achieve this, the algorithm uses two time stamp values.

1. Read_Ts (X): This indicates the largest time stamp among the transactions that have

successfully read the item X. Note that the largest time stamp actually refers to the

youngest of the transactions in the set (that has read X).

2. Write_Ts(X): This indicates the largest time stamp among all the transactions that have

successfully written the item-X. Note that the largest time stamp actually refers to the

youngest transaction that has written X.

The above two values are often referred to as “read time stamp” and “write time stamp” of the

item X.

6.4.3 The concept of basic time stamp ordering: When ever a transaction tries to read or write

an item X, the algorithm compares the time stamp of T with the read time stamp or the write

stamp of the item X, as the case may be. This is done to ensure that T does not violate the order

of time stamps. The violation can come in the following ways.

1. Transaction T is trying to write X

a) If read TS(X) > Ts(T) or if write Ts (X) > Ts (T) then abort and roll back T and

reject the operation. In plain words, if a transaction younger than T has already

read or written X, the time stamp ordering is violated and hence T is to be aborted

and all the values written by T so far need to be rolled back, which may also

involve cascaded rolling back.

b) If read TS(X) < TS(T) or if write Ts(X) < Ts(T), then execute the write tr(X)

operation and set write TS(X) to TS(T). i.e. allow the operation and the write time

stamp of X to that of T, since T is the latest transaction to have accessed X.

2. Transaction T is trying to read X

a) If write TS (X) > TS(T) , then abort and roll back T and reject the operation. This

is because a younger transaction has written into X.

b) If write TS(X) < = TS(T), execute read tr(X) and set read Ts(X) to the larger of

the two values, namely TS(T) and current read_TS(X).

This algorithm ensures proper ordering and also avoids deadlocks by penalizing the older

transaction when it is trying to overhaul the operation done by an younger transaction. Of

course, the aborted transaction will be reintroduced later with a “new” time stamp. However, in

the absence of any other monitoring protocol, the algorithm may create starvation in the case of

some transactions.

6.4.4 Strict time Stamp Ordering:

This variation of the time stamp ordering algorithm ensures that the schedules are “strict”

(so that recoverability is enhanced) and serializable. In this case, any transaction T that tries to

read or write such that write TS(X) < TS(T) is made to wait until the transaction T’ that

originally wrote into X (hence whose time stamp matches with the writetime time stamp of X,

i.e. TS(T’) = write TS(X)) is committed or aborted. This algorithm also does not cause any dead

lock, since T waits for T’ only if TS(T) > TS(T’).

6.5 Multi version concurrency control techniques

The main reason why some of the transactions have to be aborted is that they try to

access data items that have been updated (by transactions that are younger than it). One way of

overcoming this problem is to maintain several versions of the data items, so that if a transaction

tries to access an updated data item, instead of aborting it, it may be allowed to work on the older

version of data. This concept is called the multiversion method of concurrency control.

Whenever a transaction writes a data item, the new value of the item is made available, as

also the older version. Normally the transactions are given access to the newer version, but in

case of conflicts the policy is to allow the “older” transaction to have access to the “older”

version of the item.

The obvious drawback of this technique is that more storage is required to maintain the

different versions. But in many cases, this may not be a major drawback, since most database

applications continue to retain the older versions anyway, for the purposes of recovery or for

historical purposes.

6.5.1 Multiversion Technique based on timestamp ordering

In this method, several version of the data item X, which we call X1, X2, .. Xk are

maintained. For each version Xi two timestamps are appended

i) Read TS(Xi): the read timestamp of Xi indicates the largest of all time stamps of

transactions that have read Xi. (This, in plain language means the youngest of the

transactions which has read it).

ii) Write TS(Xi) : The write timestamp of Xi indicates the timestamp of the

transaction time stamp of the transaction that wrote Xi.

Whenever a transaction T writes into X, a new version XK+1 is created, with both write.

TS(XK+1) and read TS(Xk+1) being set to TS(T). Whenever a transaction T reads into X, the value

of read TS(Xi) is set to the larger of the two values namely read TS(Xi) and TS(T).

To ensure serializability, the following rules are adopted.

i) If T issues a write tr(X) operation and Xi has the highest write TS(Xi) which is less than or

equal to TS(T), and has a read TS(Xi) >TS(T), then abort and roll back T, else create a new

version of X, say Xk with read TS(Xk) = write TS(Xk) = TS(T)

In plain words, if the highest possible write timestamp among all versions is less than or

equal to that of T, and if it has been read by a transaction younger than T, then, we have no

option but to abort T and roll back all it’s effects otherwise a new version of X is created with

it’s read and write timestamps initiated to that of T.

ii) If a transaction T issues a read tr(X) operation, find the version Xi with the highest write

TS(Xi) that is also less than or equal to TS(T) then return the value of Xi to T and set the value

of read TS(Xi) to the value that is larger amongst TS(T) and current read TS(Xi).

This only means, try to find the highest version of Xi that T is eligible to read, and return

it’s value of X to T. Since T has now read the value find out whether it is the youngest

transaction to read X by comparing it’s timestamp with the current read TS stamp of X. If X is

younger (if timestamp is higher), store it as the youngest timestamp to visit X, else retain the

earlier value.

6.5.2 Multiversion two phase locking certify locks:

Note that the motivation behind the two phase locking systems have been discussed

previously. In the standard locking mechanism, write lock is an exclusive lock – i.e. only

one transaction can use a write locked data item. However, no harm is done, if the item

write locked by a transaction is read by one/more other transactions. On the other hand,

it enhances the “interleavability” of operation. That is, more transactions can be

interleaved. This concept is extended to the multiversion locking system by using what

are known as “multiple-mode” locking schemes. In this, there are three locking modes

for the item : read, write and certify. I.e. a unit can be locked for read(X), write(x) or

certify(X), as also it can remain unlocked. To see how the scheme works, we first see

how the normal read, write system works by means of a lock compatibility table.

Lock compatibility Table

Read Write

Read Yes No

Write No No

The explanation is as follows:

If there is an entry “yes” in a particular cell, if a transaction T holds the type of lock

specified in the column header and if another transaction T’ requests for the type of lock

specified in row header, the T’ can obtain the lock, because the lock modes are compatible. For

example, there is a yes in the first cell. It’s column header is read. So if a transaction T holds

the read lock, and another transaction T’ requests for the read lock, it can be granted. On the

other hand, if T holds a write lock and another T’ requests for a readlock it will not be granted,

because the action now has shifted to the first row, second column element. In the modified

(multimode) locking system, the concept is extended by adding one more row and column to the

tables.

Read Write Certify

Read Yes Yes No

Write Yes No No

Certify No No No

The multimode locking system works on the following lines. When one of the transactions has

obtained a write lock for a data item, the other transactions may still be provided with the read

locks for the item. To ensure this, two versions of the X are maintained. X(old) is a version

which has been written and committed by a previous transaction. When a transaction T wants a

write lock to be provided to it, a new version X(new) is created and handed over to T for writing.

While T continues to hold the lock for X(new) other transactions can continue to use X(old)

under read lock.

Once T is ready to commit it should get exclusive “certify” locks on all items it wants to

commit by writing. Note that “write lock” is no more an exclusive lock under our new scheme

of things, since while one transaction is holding a write lock on X, one/more other transactions

may be holding the read locks of the same X. To provide certify lock, the system waits till all

other read locks are cleared on the item. Note that this process has to repeat on all items that T

wants to commit.

Once all these items are under the certify lock of the transaction, it can commit to it’s

values. From now on, the X(new) become X(old) and X(new) values will be created only if

another T wants a write lock on X. This scheme avoids cascading rollbacks. But since a

transaction will have to get exclusive certify rights on all items, before it can commit, a delay in

the commit operation is inevitable. This may also leads to complexities like dead locks and

starvation.

Chapter: 6

TRANSACTION MANAGEMENT & CONCURRENCY CONYROL TECHNIQUE

End Chapter quizzes

Q1. The sequence of operations on the database is called

a) Schedule

b) Database Recovery

c) Locking

d) View

Q2. Two operations in a schedule are said to be in conflict if they satisfy the conditions

a) The operations belong to different transactions

b) They access the same item x

c) At least one of the operations is a write operation.

d) All of the above.

Q3. If, for every transaction, T in the schedule S, all operations of T is executed

consecutively then schedule S is called

a) Serial schedule

b) Non serial schedule

c) Time stamping


Q4. Concurrency control is needed to manage

a) Transactions from large number of users

b) Maintain consistency of database

c) Both a and b


Q5. A time stamp is a unique identifier created by the DBMS, attached to each

a) Data Item

b) Transaction

c) Schedule

d) All of the above

Q6. A read lock is also called as

a) Shared LOCK

b) Binary Lock

c) Write Lock

d) Dead Lock

Q7. Write lock is also called as

a) Two Phase Lock

b) Exclusive Lock

c) Binary Lock


Q8. The ability to recover from failures of transaction is called

a) Recoverability

b) Back up

c) Database Detection

d) Both a and b

Q9 A lock can have ONLY two states or values (1 or 0) is known as

a) Binary Lock

b) 2 Phase Lock

c) Both a and b

d) Read Lock

Q10. The property the transaction that identifies that the transaction is either fully

completed, or is not begun at all

a) Consistency

b) Atomic

c) Durability

d) Isolation

Chapter: 7

DATABASE RECOVEY, BACKUP & SECURITY

1. Introductory Concept of Database Failures and Recovery

Database operations can not be protected to the system on which it operates (both the hardware

and the software, including the operating systems). The system should ensure that any

transaction submitted to it is terminated in one of the following ways.

a) All the operations listed in the transaction are completed, the changes are

recorded permanently back to the database and the database is indicated that

the operations are complete.

b) In case the transaction has failed to achieve its desired objective, the system

should ensure that no change, whatsoever, is reflected onto the database. Any

intermediate changes made to the database are restored to their original

values, before calling off the transaction and intimating the same to the

database.

In the second case, we say the system should be able to “Recover” from the failure.

1.1 Database failure

Database Failures can occur in a variety of ways.

i) A System Crash: A hardware, software or network error can make the completion

of the transaction impossibility.

ii) A transaction or system error: The transaction submitted may be faulty – like

creating a situation of division by zero or creating a negative numbers which

cannot be handled (For example, in a reservation system, negative number of

seats conveys no meaning). In such cases, the system simply discontinuous the

transaction by reporting an error.

iii) Some programs provide for the user to interrupt during execution. If the user

changes his mind during execution, (but before the transactions are complete) he

may opt out of the operation.

iv) Local exceptions: Certain conditions during operation may force the system to

raise what are known as “exceptions”. For example, a bank account holder may

not have sufficient balance for some transaction to be done or special instructions

might have been given in a bank transaction that prevents further continuation of

the process. In all such cases, the transactions are terminated.

v) Concurrency control enforcement: In certain cases when concurrency constrains

are violated, the enforcement regime simply aborts the process to restart later.

The other reasons can be physical problems like theft, fire etc or system problems like

disk failure, viruses etc. In all such cases of failure, a recovery mechanism is to be in

place.

1.2 Database Recovery

Recovery most often means bringing the database back to the most recent consistent state, in the

case of transaction failures. This obviously demands that status information about the previous

consistent states are made available in the form a “log” (which has been discussed in one of the

previous sections in some detail).

A typical algorithm for recovery should proceed on the following lines.

1. If the database has been physically damaged or there are catastrophic crashes like disk

crash etc, the database has to be recovered from the archives. In many cases, a

reconstruction process is to be adopted using various other sources of information.

2. In situations where the database is not damaged but has lost consistency because of

transaction failures etc, the method is to retrace the steps from the state of the crash

(which has created inconsistency) until the previously encountered state of consistency is

reached. The method normally involves undoing certain operation, restoring previous

values using the log etc.

In general two broad categories of these retracing operations can be identified. As we

have seen previously, most often, the transactions do not update the database as and when

they complete the operation. So, if a transaction fails or the system crashes before the

commit operation, those values need not be retraced. So no “undo” operation is needed.

However, if one is still interested in getting the results out of the transactions, then a

“Redo” operation will have to be taken up. Hence, this type of retracing is often called

the “no-undo /Redo algorithm”. The whole concept works only when the system is

working on a “deferred update” mode.

However, this may not be the case always. In certain situations, where the system is

working on the “immediate update” mode, the transactions keep updating the database

without bothering about the commit operation. In such cases however, the updating will

be normally onto the disk also. Hence, if a system fails when the immediate updating are

being made, then it becomes necessary to undo the operations using the disk entries. This

will help us to reach the previous consistent state. From there onwards, the transactions

will have to be redone. Hence, this method of recovery is often termed as the Undo/Redo

algorithm.

2. Role of check points in recovery:

A “Check point”, as the name suggests, indicates that everything is fine up to the point.

In a log, when a check point is encountered, it indicates that all values up to that have been

written back to the DBMS on the disk. Any further crash / system failure will have to take care

of the data appearing beyond this point only. Put the other way, all transactions that have their

commit entries in the log before this point need no rolling back.

The recovery manager of the DBMS will decide at what intervals, check points need to be

inserted (in turn, at what intervals data is to be written back to the disk). It can be either after

specific periods of time (say M minutes) or specific number of transaction (t transactions) etc.,

When the protocol decides to check point it does the following:-

a) Suspend all transaction executions temporarily.

b) Force write all memory buffers to the disk.

c) Insert a check point in the log and force write the log to the disk.

d) Resume the execution of transactions.

The force writing need not only refer to the modified data items, but can include the various lists

and other auxiliary information indicated previously.

However, the force writing of all the data pages may take some time and it would be wasteful to

halt all transactions until then. A better way is to make use of the “Fuzzy check pointing” where

in the check point is inserted and while the buffers are being written back (beginning from the

previous check point) the transactions are allowed to restart. This way the i/o time is saved.

Until all data up to the new check point is written back, the previous check point is held valid for

recovery purposes.

3 Write ahead logging:

When updating is being used, it is necessary to maintain a log for recovery purposes. Normally

before the updated value is written on to the disk, the earlier value (called Before Image Value

(BFIM)) is to noted down elsewhere in the disk for recovery purposes. This process of recording

entries is called the “write – ahead logging” (write ahead of logging). It is to be noted that the

type of logging also depends on the type of recovery. If no undo / Redo type of recovery is

being used, then only those values which could not be written back before the crash, need to be

logged. But in a undo / Redo types, the values before the image was created as well as those that

were computed, but could not be written back need to be logged.

Two other update mechanisms need brief mention. The cache pages, updated by the transaction,

cannot be written back to the disk, by the DBMS manager, until and unless the transaction

commits. If the system strictly follows this approach, then it is called a “no steal “approach.

However, in some cases, the protocol allows the writing of the updated buffer back to the disk,

even before the transaction commits. This may be done, for example, when some other

transaction is in need of the results. This is called the “steal” approach.

Secondly, if all pages are updated once the transaction commits, then it is a “force approach”,

otherwise it is called a “no force” approach.

Most protocols make use of steal / no force strategies, so that there is no urgency of writing back

to the buffer once the transaction commits.

However, just the before image (BIM) and After image (AIM) values may not be sufficient for

successful recovery. A number of lists, including the list of active transaction (those that have

started operating, but have not committed yet), committed transactions as also aborted

transactions need to be maintained, to avoid a brute force method of recovery.

4. Recovery techniques based on Deferred Update:

This is a very simple method of recovery. Theoretically, no transaction can write back

into the database, until it has committed. Till then, it can only write into a buffer. Thus in case

of any crash, the buffer needs to be reconstructed, but the DBMS need not be recovered.

However, in practice, most transactions are very long and it is dangerous us to hold all their

updates in the buffer, since the buffers can run out of space and may need a page replacement.

To avoid such situations, where in a page is removed inadvertently, a simple two pronged

protocol is used.

1. A transaction cannot change the DBMS values on the disk until it commits.

2. A transaction does not reach commit stage until all it’s update values are written on to the

log and log itself in force written on to the disk.

Notice that in case of failures, recovery is by the No UNDO/REDO techniques, since all data

will be in the log if a transaction fails after committing.

4.1 An algorithm for recovery using the deferred update in a single user environment.

In a single user entrainment, the algorithm is a straight application of the REDO

procedure i.e. it uses two lists of transactions: The committed transactions since the last check

point and the currently active transactions when the crash occurs, apply the REDO to all write tr

operations of the committed transactions from the log. And let the active transactions run again.

The assumption is that the REDO operations are “idem potent”. I.e. the operations produce the

same results irrespective of the number of times they are redone provided, they start from the

same initial state. This is essential to ensure that the recovery operation does not produce a result

that is different from the case where no crash was there to begin with.

(Through this may look like a trivial constraint, students may verify themselves that not all

DBMS applications satisfy this condition).

Also since there was only one transaction active (because it was a single user system)

and it had not updated the buffer yet, all that remains to be done is to restart this

transaction.

4.2 Deferred update with Concurrent execution:

Most of the DBMS applications, we have insisted repeatedly, are multi-user in nature and

the best way to run them is by concurrent execution. Hence, protocols for recovery from a crash

in such cases are of prime importance.

To simplify the matters, we presume that we are in talking of strict and serializable

schedules. I.e. there is strict two phase locking and they remain effective till the

transactions commit themselves. In such a scenario, an algorithm for recovery could

be as follows:-

Use two lists: The list of committed transactions T since the last check point and the list of active

checkpoints T1 REDO all the write operations of committed transactions in the order in which

they were written into the log. The active transactions are simply cancelled and resubmitted.

Note that once we put the strict serializability conditions, the recovery process does not

vary too much from the single user system.

Note that in the actual process, a given item x may be updated a number of times, either

by the same transaction or by different transactions at different times. What is important to the

user is it’s final value. However, the above algorithm simply updates the value whenever it’s

value was updated in the log. This can be made more efficient by the following manner. Instead

of starting from the check point and proceeding towards the time of the crash, traverse the log

from the time of the crash backwards. Whenever a value is updated, for the first time, update it

and maintain the information that it’s value has been updated. Any further updating of the same

can be ignored.

This method though guarantees correct recovery has some drawbacks. Since the items

remain locked with the transactions until the transaction commits, the concurrent execution

efficiency comes down. Also lot of buffer space is wasted to hold the values, till the transactions

commit. The number of such values can be large, when the long transactions are working in

concurrent mode, they delay the commit operation of one another.

5 Recovery techniques on immediate update

In these techniques, whenever a writetr(X) is given, the data is written on to the

database, without bothering about the commit operation of the transaction. However,

as a rule, the update operate is accompanied by writing on to the log(on the disk), using

a write ahead logging protocol.

This helps in undoing the update operations whenever a transaction fails. This rolling

back can be done by using the data on the log. Further, if the transaction is made to commit only

after writing on to the log, there is no need for a redo of these operations after the transaction has

failed, because the values are available in the log. This concept is called the UNDO/NO-REDO

recovery algorithm. On the other hand, if some transaction commits before writing all it’s

values, then a general UNDO/REDO type of recovery algorithm is necessary.

5.1 A typical UNDO/REDO algorithm for a immediate update single user environment

Here, at the time of failure, the changes envisaged by the transaction may have

already been recorded in the database. These must be undone. A typical procedure

for recovery should follow the following lines:

a) The system maintains two lists: The list of committed transactions since the last

checkpoint and the list of active transactions (only one active transaction, infact,

because it is a single user system).

b) In case of failure, undo all the write_tr operations of the active transaction, by using

the information on the log, using the UNDO procedure.

c) For undoing a write_tr(X) operation, examine the corresponding log entry

writetr(T,X,oldvalue, newvalue) and set the value of X to oldvalue. The sequence of

undoing must be in the reverse order, in which operations were written on to the log.

d) REDO the writetr operations of the committed transaction from the log in the order in

which they were written in the log, using the REDO procedure.

5.2 The UNDO/REDO recovery based on immediate update with concurrent execution: In the concurrent execution scenario, the process becomes slightly complex. In the

following algorithm, we presume that the log includes checkpoints and the concurrency protocol

uses strict schedules. I.e. the schedule does not allow a transaction to read or write an item until

the transaction that wrote the item previously has committed. Hence, the danger of transaction

failures is minimal. However, deadlocks can force abort and UNDO operations. The simplistic

procedure is as follows:

a) Use two lists maintained by the system: The committed transactions list(since the last

check point) and the list of active transactions.

b) Undo all writetr(X) operations of the active transactions which have not yet

committed, using the UNDO procedure. The undoing operation must be in the

reverse order of writing process in the log.

c) Redo all writetr(X) operations of the committed transactions from the log in the order

in which they were written into the log.

Normally, the process of redoing the writetr(X) operations begins at the end of the log and

proceeds in the reverse order, so that when a X is written into more than once in the log, only

the latest entry is recorded, as discussed in a previous section.

6. Shadow paging

It is not always necessary that the original database is updated by overwriting the

previous values. As discussed in an earlier section, we can make multiple versions of

the data items, whenever a new update is made. The concept of shadow paging

illustrates this:

Current Directory Pages Shadow Directory

In a typical case, the database is divided into pages and only those pages that need

updation are brought to the main memory(or cache, as the case may be). A shadow directory

holds pointers to these pages. Whenever an update is done, a new block of the page is created

(indicated by the suffice(new) in the figure) and the updated values are included there. Note that

(i) the new pages are created in the order of updatings and not in the serial order of the pages. A

current directory holds pointers to these new pages. For all practical purposes, these are the

“valid pages” and they are written back to the database at regular intervals.

Now, if any roll back is to be done, the only operation to be done is to discard the current

directory and treat the shadow directory as the valid directory.

One difficulty is that the new, updated pages are kept at unrelated spaces and hence the

concept of a “continuous ” database is lost. More importantly, what happens when the “new”

pages are discarded as a part of UNDO strategy? These blocks form ”garbage” in the system.

(The same thing happens when a transaction commits the new pages become valid pages, while

the old pages become garbage). A mechanism to systematically identify all these pages and

reclaim them becomes essential.

7 Database security and authorization

It is common knowledge that the databases should be held secure, against damages,

unauthorized accesses and updatings. A DBMS typically includes a “database security

and authorization subsystem” that is responsible for the security of the database against

unauthorized accesses and attacks. Traditionally, two types of security mechanisms

are in use.

1

2

3

4

5

6

7

8

Page 2

Page 5

Page 7

Page 7(new)

Page5 (New)

Page 2 (new)

1

2

3

4

5

6

7

i) Discretionary security mechanisms: Here each user (or a group of

users) is granted privileges and authorities to access certain records,

pages or files and denied access to others. The discretion normally lies

with the database administer (DBA)

ii) Mandatory security mechanisms: These are standard security

mechanisms that are used to enforce multilevel security by classifying the

data into different levels and allowing the users (or a group of users)

access to certain levels only based on the security policies of the

organization. Here the rules apply uniformly across the board and the

discretionary powers are limited.

While all these discussions assume that a user is allowed access to the system,

but not to all parts of the database, at another level, effects should be made to prevent

unauthorized access of the system by outsiders. This comes under the purview of the

security systems.

Another type of security enforced in the “statistical database security” often

large databases are used to provide statistical informations about various aspects like,

say income levels, qualifications, health conditions etc. These are derived by collecting

a large number of individual data. A person who is doing the statistical analysis may be

allowed access to the “statistical data” which is an aggregated data, but he should not

be allowed access to individual data. I.e. he may know, for example, the average

income level of a region, but cannot verify the income level of a particular individual.

This problem is more often encountered in government and quasi-government

organizations and is studied under the concept of “statistical database security”.

It may be noted that in all these cases, the role of the DBA becomes critical. He

normally logs into the system under a DBA account or a superuser account, which

provides full capabilities to manage the Database, ordinarily not available to the other

uses. Under the superuser account, he can manage the following aspects regarding

security.

i) Account creation: He can create new accounts and passwords to users

or user groups.

ii) Privilege granting: He can pass on privileges like ability to access

certain files or certain records to the users.

iii) Privilege revocation: The DBA can revoke certain or all privileges

granted to one/several users.

iv) Security level assignment: The security level of a particular user account

can be assigned, so that based on the policies, the users become

eligible /not eligible for accessing certain levels of information.

Another aspect of having individual accounts is the concept of “database audit”.

It is similar to the system log that has been created and used for recovery purposes. If

we can include in the log entries details regarding the user’s name and account number

who has created/used the transactions which are writing the log details, one can have

record of the accesses and other usage made by the user. This concept becomes

useful in followup actions, including legal examinations, especially in sensitive and high

security installations.

Another concept is the creation of “views”. While the database record may have

large number of fields, a particular user may be authorized to have information only

about certain fields. In such cases, whenever he requests for the data item, a “view” is

created for him of the data item, which includes only those fields which he is authorized

to have access to. He may not even know that there are many other fields in the

records.

The concept of views becomes very important when large databases, which

cater to the needs of various types of users are being maintained. Every user can have

and operate upon his view of the database, without being bogged down by the details.

It also makes the security maintenance operations convenient.

Chapter: 7 DATABASE RECOVEY, BACKUP & SECURITY

End Chapter quizzes

Q1. Database Failures can occur due to;

a) Transaction Failure

b) System crash

c) Both a and b

d) Data backup

Q2. The granting of a right or privilege that enables a subject to have legitimate access to a

system or a system’s objects.

a) Authentication

b) Authorization

c) Data Unlocking

d) Data Encryption

Q3. The process of periodically taking a copy of the database and log file on to offline

storage media

a) Back up

b) Data Recovery

c) Data Mining

d) Data Locking

Q4. The encoding of the data by a special algorithm that renders the data unreadable

a) Data hiding

b) Encryption

c) Data Mining

d) Both a and c

Q5.. Access right to a database is controlled by

(a) top management

(b) system designer

(c) system analyst

(d) database administrator

Q6. Firewall - is a system that prevents unauthorized access to or from

a) Locks

b) Private network

c) Email

d) Data Recovery

Q7. 5. Digital Certificate is an attachment to an electronic message used for

a) security purposes

b) Recovery purpose

c) Database Locking

d) Both a and c

Q8. Rollback and Commit affect

(a) Only DML statements

(b) Only DDL statements

(c) Both (a) and (b)

(d) All statements executed in SQL*PLUS

Q9 Large databases are used to provide statistical information is known as:

a) Geographical Database

b) Statistical Database

c) Web Database

d) Time Database

database management system for online

Documents