database design and implementation

Christian Reina, CISSP, CISA

2010

MET CS669 Version 1.0

Table of Contents Chapter 1: DB Systems ....................................................................................................... 3

Chapter 2: Data Models ...................................................................................................... 7 Chapter 3: Relational DB Model ...................................................................................... 11 Chapter 4: Entity Relationship (ER) Modeling ................................................................ 14 Chapter 5: Normalization of DB tables ............................................................................ 18 Chapter 6: Advanced Data Modeling ............................................................................... 21

Chapter 8: Advanced SQL ................................................................................................ 26 Chapter 9: Database Design .............................................................................................. 27 Chapter 10: Database Design ............................................................................................ 32 Chapter 11 DB Performance Tuning & Query Optimization ........................................... 37 Chapter 12: Distributed database management system .................................................... 42

Chapter 13: Business Intelligence and Data Warehouses ................................................. 47

Chapter 1: DB Systems

(Section 1.1) Data vs. Info

Data: Raw Facts, constitutes the building block of information.

Information: The result of processing raw data to reveal its meaning.

Knowledge: The body of info and facts about a specific subject. Key characteristic is new knowledge can

be derived from old knowledge.

Data Management: A discipline that focuses on the proper generation, storage, and retrieval of data.

(Section 1.2) Introducing the Database and the DBMS

Database: A shared, integrated computer structure that stores a collection of:

End-user data that is raw facts of interest to the end user.

Metadata or data about data, through which the end-user data are integrated and managed.

Example, the metadata component stores info such as the name of each data element, the type of

values (numeric, dates or text) stored on each data element, whether or not the data element can be

left empty, and so on.

Collection of self-describing data:

A well-designed db facilitates data management and becomes a valuable info generator. A poorly

designed db is likely to lead to errors in processing data and to bad decisions.

The most popular way to classify db is by the use and timeliness of the data:

Production DB: Contains up-to-the-minute real world info

Data warehouse: stores data for making decisions

(Section 1.2.1) Role and advantages of the DBMS

Database Management System (DBMS): Collection of programs that manages the database structure and

controls access to the data stored in the database. Helps manage the cabinet‘s contents. Advantages are

improved data sharing, data security, and data integration, minimized data inconsistency, improved data

access, improved decision making, and increased end-user productivity.

Query: A specific request issued to the DBMS for data manipulation to read or update the data. A query is

a question.

Ad hoc query: is a spur of the moment questions.

Query result set: is when the DBMS sends back an answer to the application.

(Section 1.2.2) Types of databases

DBMS can support many different types of databases:

(Number of users)

Single-user database: supports only one user at a time.

Desktop Database: A single user database that runs on a personal computer.

Multi user database: supports multiple users at the same time.

Workgroup Database: Multi user database supports a relatively small number of users (usually fewer than

50) or a specific department within an org.

Enterprise Database: Database is used by the entire organization and supports many users across many

departments. Usually in the hundreds.

(Db site location)

Centralized Database: It supports data located at a single site.

Distributed Database: It supports data distributed across several different sites.

(Db use)

Operational or Transactional or Production Database: A db that is designed primarily to support a

company‘s day to day operations.

Data warehouse: focuses primarily on storing data used to generate information required to make tactical

or strategic decisions. They derive most of their data from production db‘s.

Unstructured data: Data that exist in their original raw state.

Structured data: The result of taking unstructured data and formatting or structuring such data to

facilitate storage, use, and the generation of info.

Semi-structured data: data that have already been processed to some extent or prearranged.

XML: Language used to represent and manipulate data elements in a textual format.

(Section 1.3) Why Database design is important

DB Design: Refers to the activities that focus on the design of the database structure that will be used to

store and manage end user data. A well designed database facilitates data management and generates

accurate and valuable info. A poorly designed db will likely become a breeding ground for redundant data

and data anomalies.

(Section 1.4) Historical roots: files and file systems

An understanding of the relatively simple characteristics of file systems makes the complexity of

database design easier to understand.

An awareness of the problems that plagued file systems can help you avoid those same pitfalls

with DMBS software.

If you intend to convert an obsolete file system to a database system, knowledge of the file

systems basic limitations will be useful.

Data Processing (DP) Specialists: Created the necessary computer file structures, often wrote the software

that managed the data within those structures, and designed the app programs that produced reports based

on the file data. As files increased they would have to hire more Data processing Specialists for

accommodations and the original DP specialist would become the DP manager.

(Section 1.5) Problems with files system and data management

Making changes in a existing structure can be difficult in a file system environment.

1) Reads a record from the original file

2) Transforms the original data to conform to the new structures storage requirements.

3) Writes the transformed data into the new file structure.

4) Repeats steps 2 to 4 for each record in the original file.

It requires extensive programming for pulling records, deleting, and updating.

It can not perform ad hoc queries.

System admin can be complex and difficult when records or files expand.

It is difficult to make changes to existing structures.

Security features are likely to be inadequate.

Each file typically required its own set of data management programs.

Many files would suffer from data redundancy leading to inconsistencies, anomalies, and lack of

data integrity.

A mature files based data system might require hundreds of thousands of programs.

These limitations lead to problems of structural and data dependency.

(Section 1.5.1) Structural and data dependence

Structural dependence: A file system exhibits this, which means that access to the file is dependent on its

structure. For example adding a customer DOB to the Customer file would require the 4 steps described in

section 1.5.

Structural independence: Exists when it is possible to make changes in the file structure without affecting

the app programs ability to access the data.

Data dependence: When data access programs are subject to change when any of the files data storage

characteristics change, (That is, changing the data type). Makes the file system cumbersome.

Data independence: Exists when it is possible to make changes in the data storage characteristics without

affecting the application programs ability to access the data.

Logical data format: How the human views the data.

Physical data format: How the computer must work with the data.

Any program that accesses a file system‘s file must tell the computer what to do and how to do it.

(Section 1.5.2) Field definitions and naming conventions

Be descriptive in the field names but be aware of DBMS character length restrictions: Example REN

should be CUS_RENEW_DATE

(Section 1.5.3) Data Redundancy

Islands of Info: They contain different versions of the same data. It‘s the storage of the same basic data in

different locations.

Redundant Data: Source of difficult-to-trace info errors. It‘s when the same data about the same entity is

kept in different locations. It can result in storage of different values for the same attribute of the same

entity. They are a result of a poorly designed db which can lead to poor decision making.

Data Redundancy: Exists when the duplicated data are stored unnecessarily at different places. They are a

result of a poorly designed db which can lead to poor decision making.

Data Integrity: Condition in which all of the data in the db are consistent with the real-world events and

conditions. In other worlds, data integrity means that:

Data are accurate—there are no data inconsistencies

Data are verifiable—the data will always yield consistent results

Data Anomaly: Develops when all of the required changes in the redundant data are no made successfully.

(Section 1.6) Db systems

DBMS provides numerous advantages over file system management by making it possible to eliminate

most of the files systems data inconsistency, data anomaly, data dependency, and structural dependency

problems.

(Section 1.6.1) The Db system environment

Database: Organization of components that define and regulate the collection, storage, management, and

use of data within a database environment. It‘s composed of 5 major parts Hardware, Software, People,

Procedures, and Data.

(Jobs in the db field)

DB Admin: Focused on individual db and DBMSs & strong technical skills in specific DBMSs.

Data Admin: Plans for db and technology, sets standards for data (Privacy & risk of loss), works

with computerized and non-computerized dbs.

DB Modeler/Analyst/Designer/Programmer: Responsible for design & implementation of db

and the app systems that interface with a DBMS. Modeler’s primary responsibility is gathering

the data requirements and representing them in the data model. Designer may participate in the

modeling, and translates the model into an operational db, often with the assistance of system and

storage admins. App Analysts gather, doc and coordinate the app and user requirements.

Programmers write the software apps, based on the application and data requirements.

(Section 1.6.2) DBMS functions

Data Dictionary management: DBMS stores data elements & their relationships (metadata in a

data dictionary. DBMS uses the Data dictionary to look up the required data component structures

and relationships, thus relieving you from having to code such complex relationships in each

program. Any changes made in a db structure are automatically recorded in the data dictionary.

Data storage management: DBMS provides storage not only for the data but also for the related

data entry forms or screen definition, report definition, data validation rules, procedural code,

structures to handle video and pic formats. It‘s also important for Performance Tuning, which

relates to the activities that make the db perform more efficiently in terms of storage and access

speed. (DBMS creates the complex structures required for data storage)

Data transformations & presentation: When the DBMS formats the physically retrieved data to

make it conform to the users logical expectations. (DBMS transforms entered data to conform to

the data structures)

Security management: The DBMS creates a security system that enforces user security and data

privacy. They determine which users can access the db, which data items each user can access,

and which data operations (read, add, delete, or modify) the user can perform. This is important

during multi-user mode. (DBMS creates a security system and enforces security within that

system)

Multi-user access control: To provide data integrity and data consistency, the DBMS uses

sophisticated algorithms to ensure that multiple users can access the db concurrently without

compromising the integrity of the db. (DBMS allows multiple users to have concurrent access to

the data)

Backup & Recovery: The DBMS provides backup and data recovery to ensure data safety and

integrity. Current DBMS systems provide special utilities that allow the DBA to perform routine

and special backup and restore procedures. (DBMS performs backup and data recovery procedures

to ensure data safely)

Data integrity management: The DBMS promotes and enforces integrity rules, thus minimizing

data redundancy and maximizing data consistency. The data relationships stored in the data

dictionary are used to enforce data integrity. Ensuring data integrity is especially important in

transaction-oriented db systems. (DBMS promotes and enforces integrity rules to eliminate data

integrity problems)

Db access languages & application programming interfaces: The DBMS provides data access

through a query language. A Query Language is a nonprocedural language that lets the user

specify what must be done without having to specify how it is to be done. Structured Query

Language (SQL) is the de facto query language and data access standard supported by the

majority of DBMS vendors. (DBMS provides access to the data via utility programs and

programming language interfaces)

Db comm Interfaces: DBMS accept en-user requests via multiple, different network

environments. For example, DBMS might provide access to the database via the internet through

the use of firefox or IE. DBMS and can automatically publish predefined reports on a web-site.

DBMS can connect to third party systems to distribute info via email or other apps. (DBMS

provides access to data within a computer network environment)

(Section 1.6.3) Managing the db system: A shift in focus

Db systems significant disadvantages:

Increased costs

Management complexity:

Maintaining currency:

Vendor dependence:

Frequent upgrade/replacement cycles: ----------------------------------------------------------------------------------------------------------------------------- ----

Chapter 2: Data Models

(Section 2.1) Data modeling and data models

Data modeling: The first step in db design refers to the process of creating a specific data model for a

determined problem domain. This is an iterative process.

Problem Domain: Is a clearly defined area within the real world environment.

Data Model: Collection of concepts that can be used to describe the structure of a db. Its main function is

to help us understand the complexities of the real world environment. It facilitates comm. between users, db

designers, and app programmers. There are 3 categories of data models:

High level or conceptual data models, which are based on entities (objects) and relationships.

Low level or physical data modes, which are specific to particular DBMS such as Oracle.

Representational or implementation data models, which are also termed logical data models.

(Section 2.2) The importance of data models

When a good db blueprint is not available, problems are likely to happen. Data models are like blue prints

and they are an abstraction.

(Section 2.3) Data model basic building blocks

The basic building blocks of all data models are entities, attributes, and relationships.

Entity: Is anything, such as a person, place, thing, idea, or event, about which data are to be

collected and stored. They can be physical such and customers or products or abstractions such as

flight routes or accounts.

Attributes: are equivalent to fields for an entity such as a CUSTOMER. They can be Fname,

Lname, etc.

Relationship: describes an association among two or more entities. For example, ―An AGENT

can serve many CUSTOMERS, and each CUSTOMER may be served by one AGENT‖. Data

models use 3 types of relationship:

o One-to-many (1:M): painter paints many different paintings, but each one is painted by

only one painter.

o Many-to-many (M:N): a student can take many classes and each class can be taken by

many students.

o One-to-one (1:1): each store employee only manages one store.

Entity-relationship model (ERM): helps identify the db‘s main entities and their relations. They are

graphically represented so it‘s more easily understood by users and designers.

Entity-relationship Diagram (ERD): Chen model and Crows foot model.

Constraints: Restrictions placed on the data. They help to ensure data integrity.

(Section 2.3) Business rules

Business rule: is a brief, precise, and unambiguous description of a policy, procedure, or principle within a

specific org. They are used to define entities, attributes, relationships, and constraints.

(Section 2.4.1) Discovering business rules

The process of id and doc business rules is essential to db design for several reasons:

They help standardize the company‘s view of data

They can be comm. Tool between users and designers.

They allow the designer to understand the nature, role, and scope of the data.

They allow the designer to understand business processes.

They allow the designer to develop appropriate relationship participation rules and constraints

and to create an accurate data model.

(Section 2.4.2) Translating business rules into data model components

A noun is a business rule translating into an entity in the model, and a verb (active or passive) associating

nouns will translate into a relationship among the entities. Example ―a customer may generate many

invoices‖ contains two nouns (customer & invoices) and a verb ―generate‖ that associates the nouns.

To id relationship type, you should ask two questions:

How many instances of B are related to one instance of A?

How many instances of A are related to one instance of B?

(Section 2.5) The evolution of data models

Evolution of Major Data Models Table 2.1 page 35

(Section 2.51) The hierarchical model

Hierarchical: model developed in 1960s to manage large amounts of data for complex manufacturing

projects such as the Apollo rocket that landed on the moon in 1969. Logical structure is depicted by an

upside-down tree. Disadvantages were: too complex to implement, it was difficult to manage, and it lacked

structural independence. No standards for how to implement the model. Record based model.

Segment: equivalent of a file system‘s record type.

(Section 2.5.2) Network model

Network model: was created to represent complex data relationships more effectively than the hierarchical

model, to improve db performance, and to impose a db standard. Its disadvantages were limited data

independence and lack of ad hoc query capability. Record based model class.

To help establish db standards, the Conference on Data Systems Languages (CODASYL) created the

Database Task Group (DBTG) in the late 1960s. The DBTG report contained specifications for 3 crucial

db components.

Schema: It includes a definition of the db name, the record type for each record, and the

components that make up those records.

Subschema: The existence of subschema definitions allows all app programs to simply invoke the

subschema required to access the appropriate db files.

Data management language (DML): defines the environment in which data can be managed. To

produce the desired standardization for each of the three components, the DBTG specified three

distinct DMB components:

o Schema Data definition language (DDL) enables the db admin to define the schema

components.

o Subschema DDL allows the app program to define the db components that will be used

by the app.

o Data manipulation language to work with a data in the db.

Network model allows a record to have more than one parent unlike the hierarchical model. In network db

terminology a relationship is called a Set and each Set is composed of at least 2 record types.

(Section 2.5.3) The relational model

Relational model: introduced in 1970 by E.F. Codd of IBM. You can think Relations (or Table) as a

matrix composed of intersecting rows and columns. Each row in a relation is called a Tuple. They are

record based models.

Relational db management system (RDBMS): It performs the same basic functions provided by the

hierarchical and network DBMS systems, but other functions that make the relational data model easier to

understand and implement. Its disadvantages were that it did not examine structures graphically.

Most important advantage is its ability to hide the complexities of the relational model from

the user. It manages all the physical details as the user sees the relational db as a collection of

tables in which data are stored.

RDBMS uses SWL to translate user queries into instructions for retrieving the requested data.

There is one crucial difference between a table and a file: the table yields complete data and structural

independence because it is a purely logical structure.

Any SQL based relational db app involves 3 parts:

End-user interface: Allows the end user to interact with the data.

Tables: Tables are independent of each other which hold the data.

SQL engine: that tells what must be done but not how it must be done. It does all the work in the

back-ground, such as executing queries or data requests.

(Section 2.5.4) The entity relationship model

ERM: Peter Chen introduced ERM in 1976. It‘s the graphical representation of entities and their

relationships in a db structure. ERM is represented in ERD. The ER model is based on the following

components: Object based models.

Entity: each row in the relational table is known as an entity instance or entity occurrence in the

ER model. Each entity is described by a set of attributes that describes particular characteristics of

the entity.

Relationships: they describe associations among data. Most relationships describe associations

between two entities. ER model uses the term connectivity to label the relationship types. The

name of the relationship is an active or a passive verb.

(Section 2.5.5) The Object Oriented (OO) Model

OODM: both data and their relationships are contained in a single structure known as an object. In turn,

the OODM is the basis for OODBMS. Unlike an entity, an object includes info about relationships

between the facts within the object, as well as info about its relationship with other objects. OODM is said

to be Semantic Data Model because semantic indicates meaning. The OODM is based on the following

components.

An object is an abstraction of a real world entity.

Attributes describe the properties of an object.

Objects that share similar characteristics are grouped in classes. A class is a collection of similar

objects with shared structure (attributes) and behavior (Methods). Methods represent a real world

action such as finding a selected PERSON‘s name, changing a PERSON‘s name, or printing a

PERSON‘s address. Methods are like procedures. In OO methods are defined as behaviors.

Classes are organized in a class hierarchy which represents an upside-down tree where each class

only has one parent.

Inheritance is the ability of an object within the class hierarchy to inherit the attributes and

methods of the classes above it

Disadvantage is steeper learning curve.

(Section 2.5.6) The convergence of data models

Extended relational data model (ERDM): It‘s semantic and it‘s described as object/relational db

management system (O/RDBMS). It‘s primarily geared to business apps, while the OODM tends to focus

on very specialized engineering and scientific apps.

The traditional entity-relationship model and the most important features of object-oriented models have

been combined in the extended (or enhanced) entity-relationship model (EERM).

(Section 2.5.7) Database models and the internet

(Section 2.5.8) Data Models: Summary

Advantages and disadvantages of db modes depicted on page 47.

Data model basic terminology comparison on page 48.

(Section 2.6) Degrees of data abstraction

A db designer starts with an abstract view of the overall data environment and adds details as the design

comes closer to implementation. The design of a DB can be divided into 4 four models with decreasing

level of abstraction: Conceptual, Internal, External, & Physical.

American National Standards Institute (ANSI): Standards Planning and Requirements Committee

(SPARC) defined a framework for data modeling based on degrees of data abstraction. They define three

levels of data abstraction: External, Conceptual, and internal.

(Section 2.6.1) External model

External Model: is the end users view of the data environment.

External Schema: It‘s a specific representation of an external view.

External Views advantages:

It makes it easy to id specific data required to support each business unit‘s ops.

It makes the designer‘s job easy by providing feedback about the model‘s adequacy.

It helps to ensure security constraints in the db design.

It makes app program development much simpler.

(Section 2.6.2) The conceptual Model

Conceptual Model: represents a global view of the entire db as viewed by the entire org. That is, the

conceptual model integrates all external views (entities, relationships, constraints, and processes) into a

single global view of the entire data in the enterprise. Also know as Conceptual Schema as it is the basis

for the id and high level description of the main data objects. The most widely used conceptual model is the

ER model, which is the basic db blueprint. ERD is used to graphically represent the conceptual schema.

Advantages of Conceptual Models:

It provides a relatively easily understood bird‘s-eye view of the data environment

Is independent of both software and hardware.

Software independence: model does not depend on the DBMS software used to implement the model.

Hardware independence: model does not depend on the hardware used in the implementation of the

model.

Logical design: used to refer to the task of creating a conceptual data model that could be implemented in

any DBMS.

(Section 2.6.3) The internal model

Internal model: is the representation of the db as ―seen‖ by the DBMS. It requires the designer to match

the conceptual models characteristics and constraints to those of the selected implementation model.

Internal model is software-dependent but hardware-independent because it is unaffected by the choice of

the computer on which the software is installed.

Internal schema: depicts a specific representation of an internal model, using the db constructs supported

by the chosen db.

Logical independence: When you can change the internal model without affecting the conceptual model.

(Section 2.6.4) The physical model

Physical model: operates at the lowest level of abstraction, describing the way data are saved on storage

media. It is software and hardware dependent. Physical model is dependent on the DBMS.

Physical independence: when you can change the physical model without affecting the internal model.

Summary on page 52.

----------------------------------------------------------------------------------------------------------------------------- ----

Chapter 3: Relational DB Model

(Section 3.1) A Logical View o Data

The relational data model allows designer to focus on the logical representation of the data and its

relationships, rather than on the physical storage details. Like an automatic transmission. The relational db

provides the advantages of structural and data independence. The relational model was introduced by Ted

Codd of IBM Research in 1970. Record based.

(Section 3.1.1) Tables and their characteristics

1) Table is perceived as a 2 dimensional structure of rows and columns,

2) Each table row represents a single entity occurrence within the entity set,

3) Each table column represents an attribute, and each column has a distinct name,

4) Each row/column intersection represents a single data value,

5) All values in a column must conform to the same data formant,

6) Each column has a specific range of values known as the attribute domain,

7) The order of the rows and columns is immaterial to the DBMS,

8) Each table must have an attribute or a combination of attributes that uniquely id each row.

Most DBMSs support the following data types.

Numeric: Anything concerned with arithmetic‘s.

Characters: Text data or string data.

Date: Date attributes contain calendar dates stored in a special format known as the Julian data format. It

allows you to do a special kind of arithmetic known as Julian date arithmetic.

Logical: Logical data can have only a true or false (yes or no) condition.

Domain: Is the column‘s range of permissible values.

Primary Key: Each table must have one, PK is an attribute or (combination of attributes) that uniquely ids

any given row. It‘s a unique identifier. No duplicate values are allowed. PK generally cannot be changed.

(Section 3.2) Keys

Keys are important in a relational model because they are used to ensure that each row in a table is

uniquely identifiable. They are used to establish relationships among tables and to ensure the integrity of

the data. A Key is consists of one or more attributes that determine other attributes. Keys are used to id

specific occurrences of entities within an entity group.

Determination: A keys role is based upon this. For example if A determines B, C, and D, than A

B,C,D, for example STU_NUM determines STU_LNAME. This principle is important because it is used in

the definition of a central relation db concept known as functional dependence.

Functional Dependence: The attribute B is functionally dependent on the attribute A if each value in

column A determines one and only one value in column B.

Composite Key: It‘s a multi-attribute key which can be composed of more than one attribute.

Key Attribute: Any attribute that is part of a key.

Full Functional Dependence: If attribute (B) is functionally dependent on a composite key (A) but not on

any subset of that composite key, the attribute (B) is fully functionally dependent on (A).

Superkey: Any key that uniquely id each row. It functionally determines all of a row‘s attributes.

Candidate key: can be described as a superkey without unnecessary attributes, that is, a minimal superkey.

If there was a STU_SSN and STU_NUM =, both would be candidate keys because either one would

uniquely id each student.

Entity Integrity: It‘s when all the rows in a table can be uniquely id by a primary key. To maintain entity

integrity, a null (that is, no data entry at all) is not permitted in the primary key.

Null: no value at all. It does not mean 0 or a space. A null is created when you press the Enter Key or the

Tab key to move to the next entry without making a prior entry of any kind. The can never be part of a

primary key and they should be avoided in other attributes. The existence of nulls in a table is often an

indication of poor db design. A null can represent an unknown attribute value, a known, but missing

attribute value, or a ―not applicable‖ condition.

Relational Schema: Textual representation of the database tables where each table is listed by its name

followed by the list of its attributes in parentheses.

Foreign Key: Attribute or combination of attributes in one table whose values match the primary key

values in the related table or be null. Example, VEND_CODE is the primary key in VENDOR table and it

occurs as a FK in the PRODUCT table. You can logically relate data from multiple tables using FK. FK are

based on data values and are purely logical, not physical, pointers. A FK value must match an existing PK

value or unique key value, or else be NULL.

Referential Integrity: If the FK contains a value, that value refers to an existing valid tuple or row in

another relation. It‘s maintained between the PRODUCT and VENDOR tables. To maintain Referential

Integrity, the FK must contain values only found in the other table, or null values to indicate that the rows

are not linked.

Secondary Key: Is used strictly for data retrieval purposes. It‘s not the Customer number but it can be a

combination of attributes such as customer phone and last name. Buts it‘s not always entirely unique. It‘s

another way of narrowing down a search when you don‘t know the unique customer number.

(Section 3.3) Integrity Rules

Entity Integrity:

Requirement: All primary key entries are unique, and no part of a primary key may be null.

Purpose: each row will have a unique id, and foreign key values can properly reference primary

key values.

Example: No invoice can have a duplicate number, nor can it be null. In short, all invoices are

uniquely id by their invoice number.

Referential Integrity:

Requirement: A foreign key may have either a null entry, as long as it is not a part of its table‘s

primary key, or an entry that matches the primary key value in a table to which it is related. Every

non-null foreign key value must reference an existing primary key value.

Purpose: it is possible for an attribute not to have a corresponding value, but it will be impossible

to have an invalid entry. The enforcement of the referential integrity rule makes it impossible to

delete a row in one table whose primary key has mandatory matching foreign key values in

another table.

Example: A customer might not have an assigned sales rep (number), but it will be impossible to

have an invalid sales rep (number).

Flags are used to indicate the absence of some values. It‘s a trick to avoid using nulls.

(Section 3.4) Relational set operators

Relational Algebra: Defines the theoretical way of manipulating table contents using the eight relational

operators: Select, Project, Join, Intersect, Union, Difference, Product, and Divide.

Closure: The use of relational algebra operators on existing tables (relations) produces new relations.

UNION: Combines all rows from two tables, excluding duplicate rows. Tables must have the same

attribute characteristics (columns and domains must be id) to be used in the UNION. When two or more

tables share the same number of columns, when the columns have the same names, and when they share the

same (or compatible) domains, they are said to be Union-Compatible.

INTERSECT: Yields only the rows that appear in both tables. Cannot intersect if one of the attributes is

numeric and one is character based. The tables must be union compatible.

DIFFERENCE: Yields all rows in one table that are not found in the other table. It subtracts one table

from the other. The tables must be union compatible.

PRODUCT: Yields all possible pairs of rows from two tables – also known as the Cartesian product. If

one table has 6 rows and the other table has 3 rows, the PRODUCT yields a list composed of 6x3=18.

SELECT: AKA RESTRICT, yields values for all rows found in a table that satisfy a given condition. It‘s

used to list all the row values, or it can yield only those row values that match a specified criterion.

PROJECT: Yields all values for selected attributes. It yields a vertical subset of a table.

JOIN: It allows info to be combined from 2 or more tables. It‘s the real power behind the relational db,

allowing the use of independent tables linked by common attributes.

Natural Join: Links tables by selecting only the rows with common values in their common attributes. It‘s

the result of the three staged process.

Join Columns: or common columns.

Equijoin: another form of join that links tables on the basis of an equality condition that compares

specified columns of each table. Equijoin takes its name from the equality comparison operator (=) used in

the condition. If any other comparison operator is used, the join is called a Theta Join, They are less

common than Equijoin and they represent inequalities.

Outer Join: The matched pairs would be retained and any unmatched values in the other table would be

left null. If an outer join is produced, 2 scenarios are possible:

Left Outer Join: Yields all of the rows in the CUSTOMER table, including those that do not have

a matching value in the AGENT table.

Right Outer Join: Yields all of the rows in the AGENT table, including those that do not have

matching values in the CUSTOMER table.

DIVIDE: uses on single-column table as the divisor and one 2-column table as the dividend. The tables

must have a common column.

(Section 3.5) The data dictionary and the system catalog

Data Dictionary: Provides a detailed description of all tables found within the user/designer-created db. It

contains all attribute names and characteristics for each table. It contains metadata. Sometimes referred to

as ―the db designer‘s db‖.

System Catalog: It contains metadata. It‘s a detailed system data dictionary that describes all objects

within the db. It contains more info than the data dictionary. Created by the dbms.

Homonyms: Similar or identically sounding words with different meanings, Such as boar and bore. It

indicates the use of the same attribute name to label different attributes. This should be avoided though.

Synonym: Is the use of different names to describe the same attribute. For example a car and auto refer to

the same thing. This should also be avoided in db design.

(Section 3.6) Relationships within the Relational db

1:M is the relational modeling ideal

1:1 should be rare in any relational db design.

M:N cannot be implemented as such in the relational model.

(Section 3.6.1) 1:M Relationship Page 80 review

(Section 3.6.2) 1:1 Relationship Page 82 review

1:1 sometimes means that the entity components where not defined properly. It could indicate that two

entities belong in the same table. They should be rare but certain conditions require their use. There is a

great example of 1:1 relationship regarding employees in page 84.

(Section 3.6.3) M:N Relationship Page 84 review

M:N is no supported directly in the relational environment. However, M:N relationships can be

implemented by creating a new entity in 1:M relationships with the original entities. The problems inherent

in M:N relationships can be avoided by creating Composite entity (or bridge entity or associative entity)

Linking Table is the implementation of a composite entity.

(Section 3.7) Data Redundancy Revisited

Relational dbs makes it possible to control data redundancies by using common attributes that are shared by

tables, called foreign keys. Although the use of FK does not totally eliminate data redundancies because the

FK values can be repeated many times, the proper use of FK minimizes data redundancies, thus minimizing

the chance that destructive data anomalies will develop.

Db designers must reconcile three often contradictory requirements: Design elegance, Processing speed,

and Info requirements.

As important as data redundancy control is, there are times when the level of data redundancy must actually

be increased to make the db serve crucial info purposes. Illustrated on page 89.

(Section 3.8) Indexes

Index: An orderly arrangement used to logically access rows in a table. It is composed of an index key and

a set of pointers. An index is an ordered arrangement of keys and pointer, each key points to the location of

the data identified by the key. Indexes play an important role in DBMSs for the implementation of primary

keys. When you define a table‘s primary key, the DBMS automatically creates a unique index on the

primary key column you declared. A table can have many indexes, but each index is associated with only

one table. If a table is dropped so will be the index that was created for it.

Index Key: Is the index‘s reference point. It can also be composed of one or more attributes.

Unique Index: is an index in which the index key can have only one pointer value (row) associated with it.

They are used to enforce uniqueness constraints.

(Section 3.9) CODD’S Relational db rules

In 1985 DR. E.F. Codd published a list of 12 rules to define a relational db system. The 12 rules are located

on page 92. Not all db vendors fully support all 12 rules.

----------------------------------------------------------------------------------------------------------------------------- ----

Chapter 4: Entity Relationship (ER) Modeling

(Section 4.1) Entity Relationship Model (ERM) Conceptual models are used in the conceptual design of dbs, while relational models are used in the logical

design of dbs. ERM is a conceptual model. OO.

ERD should be developed prior to building a db & it depicts the dbs main components such as entities,

attributes, and relationships, and relationship types. There are various notations used with ERD such as

Chens notation, Crows foot, UML notations.

Chens notation favors conceptual modeling.

The Crows foot notation favors a more implementation-orientated approach.

The UML notation can be used for both conceptual and implementation modeling.

What role does the ER diagram play in the design process?

The ER diagram must reflect an organization's operations accurately if the database is to meet that

organization's data requirements. The completed ER diagram forms the basis for design review processes

that verify whether the included entities are appropriate and sufficient, whether the attributes found within

those entities are needed and correct, and whether the relationships between those entities are needed and

correctly represented. The ER diagram is also used as a final crosscheck against the proposed data

dictionary entries. The ER diagram helps the database designer communicate more precisely with those

who most completely understand the business data requirements. Finally, the completed ER diagram serves

as the implementation guide to those who create the actual database. Many ERD software tools can

generate the SQL statements to produce the tables represented in the ERD. In short, the ER diagrams are as

important to the database designer as blueprints are to architects and builders.

Why is data modeling so important to the database designer?

We are said to live in the information age, and data constitute the most basic information units employed by

an information system. Data modeling provides a way to reconcile the very different end-user views of the

nature and roles of data.

A data model is an abstraction that provides an easily understood representation of complex real-world data

structures. A data model helps us understand, communicate and document the complexities of a real-world

data environment. Such understanding yields useful solutions to the problems inherent in creating,

organizing, using, and managing data.

If a database is to be useful and flexible, it must be well designed. The database design process must be

based on an appropriate data model if it is to yield a proper database design blueprint.

(Section 4.1.1) Entities

The world entity in the ERM corresponds to a table, not to a row, in the relational environment.

The ERM refers to a table row as an entity instance or entity occurrence. Entity name is a noun with the

shape of a rectangle written in cap letters. It‘s a person, place, tings, or shared ideas about which data are

collected and stored. They are the basic building blocks of a relational db.

Entity Sets: Entities that are often grouped according to common attributes. They are stored in tables.

(Section 4.1.2) Attributes

Attributes are characteristics that describe entities. Example STUDENT entity includes STU_LNAME

STU_FNAME and so on. They are oval shaped in Chen model. They are columns that represent a

characteristic of the entity.

Required and Optional Attributes:

Required Attributes: is an attribute that must have a value. I cannot be left empty.

Optional Attribute: is an attribute that does not require a value, therefore it can be left empty.

Domains: Attributes have a domain. A domain is the set of possible values for a given attribute. The

domain for the GPA attribute is written (0,4) because the lowest possible GPA value is 0 and the highest is

4. Domain for male and female is M or F. Attributes may share a domain.

Identifiers: One or more attributes that uniquely id each entity instance. They are underlined in ERD and

are mapped to PK.

Composite Identifiers: A PK composed of more than one attribute.

Composite and simple attributes: Attributes are classified as simple or composite.

Composite attribute: An attribute that can be further sub-divided to yield additional attributes.

Example ADDRESS can be sub-divided into Street, City, & Zip.

Simple Attribute: It cannot be sub-divided. Example Age, Sex, and Marital Status.

Single Valued Attributes: Is an attribute that can have only a single value. Example a person can only

have one SSN. It‘s not always a simple attribute.

Multivalued Attributes: They can have many values. Example is a person can have several college

degrees. They are a double line in the Chen notation. It‘s not identified in crows foot notation. Although the

conceptual model can handle M:N relationships and multivalued attributes, you should not implement them

in the RDBMS.

Implementing Multivalued Attributes: They should not be implemented in RDBMS. There is one or two

possible course of action with multivalued attributes in a relational table.

Splitting the multivalued attribute into new attributes

Create a new entity composed of the original multivalued attribute components. Its preferred.

Derived Attributes (Computed Attributes): Attribute whose value is calculated (derived) from other

attributes. It‘s a dashed line in Chen notation. It‘s derived by using algorithms. Example INT ((DATE() –

EMP_DOB)/365) to calculate the age of the person. Advantages & disadvantages of storing derived

attributes:

Adv Stored: Saves CPU processing cycles, Saves Data access time, Data value is readily available, Can be

used to keep track of historical data. Adv-Not Stored: Saves storage space and computation always yields

current value.

Dis-adv Stored: Requires constant maintenance to ensure derived value is current, especially if any values

used in the calculation change. Dis-Adv Not Stored: Uses CPU processing cycles, increases data access

time, and Adds coding complexity to queries.

(Section 4.1.3) Relationships

A relationship is an association between entities. The entities that participate in a relationship are also

known as participants. Relationships name is an active or passive verb; for example; a STUDENT takes a

CLASS. They operate in both directions.

(Section 4.1.4) Connectivity and Cardinality

Connectivity: It‘s used to describe the relationship classification.

Cardinality: It expresses the min and max number of entity occurrences associated with one occurrence of

the related entity. ERD depicts it as (1,4) 1 being MIN and 4 being MAX.

Connectivities and cardinalities are generally based on business rules and must consider the data

environment, transactions, and information requirements.

(Section 4.1.5) Existence Dependence

Existence-Dependent: it can exist only when it is associated with another related entity occurrence.

Example EMPLOYEE claims DEPENDENT. The entity DEPENDENT is clearly existence-dependent on

the EMPLOYEE entity because it is impossible for the dependent o exist apart form the EMPLOYEE in the

db.

Existence-Independent (strong or regular): If an entity can exist apart from one or more related entities.

(Section 4.1.6) Relationship Strength

Entities that are existence-independent on another entity are said to have weak or non-identifying

relationships. The concept of relationship strength is based on how the PK of a related entity is defined.

Weak or (Non-identifying) Relationships: Exists if the PK of the related entity does not contain a PK

component of the parent entity. COURSE (CRS_CODE, DEPT_CODE)

CLASS (CLASS_CODE, CRS_CODE) Dashed line ------Crows foot

Strong or identifying relationships exist when the entities are existence-dependent.

Strong or (Identifying) Relationships: Exists when the PK of the related entity contains a PK component

of the parent entity. COURSE (CRS_CODE, DEPT_CODE)

CLASS (CRS_CODE, CLASS_SECTION) Solid line _____ Crows foot

(Section 4.1.7) Weak Entities

Weak Entity is one that meets two conditions: The entity is existence-dependent or the entity has a

primary key that is partially or totally derived from the parent entity in the relationship.

EMPLOYEE has DEPENDENT, DEPENDENT is weak because it cannot exist without

EMPLOYEE. Its shape is a double rectangle. Weak entity inherits part of its PK from its

strong counterpart. EMPLOYEE (EMP_NUM

DEPENDENT (EMP_NUM, DEP_NUM

(Section 4.1.8) Relationship Participation

It‘s either optional or mandatory. Optionally Participation means that on entity occurrence does not

require a corresponding entity occurrence in a particular relationship. Each entity is implemented as a table.

In the COURSE generates CLASS at least some courses do not generate a class. In other words, an entity

occurrence (row) in the COURSE table does not necessarily require the existence of a corresponding entity

occurrence in the CLASS table. Therefore the CLASS entity is considered to be optional to the COURSE

entity. It‘s the O shape on the line of the Crows foot diagram.

Mandatory Participation: Means that one entity occurrence requires a corresponding entity occurrence in

a particular relationship. It indicates that the min cardinality is 1 for the mandatory entity. Mandatory on

the ―1‖ side and optional on the ―Many‖ side. Crows Foot Symbols on page 120.

(Section 4.1.9) Relationship Degree

Relationship Degree indicates the number of entities or participants associated with a relationship.

Unary Relationship exists when an association is maintained within a single entity.

An employee within the EMPLOYEE entity is the manager for one or more employees within that

entity. In this case the existence of the ―managers‖ relationship means that EMPLOYEE requires

another EMPLOYEE to be the manager that is, EMPLOYEE has a relationship with it self. Such

relationship is known as the Recursive Relationship.

Binary Relationship exists when two entities are associated in a relationship. Most common.

Ternary relationship exists when three entities are associated.

(Section 4.1.10) Recursive Relationships

Recursive Relationship is one in which a relationship can exist between occurrences of the same entity

set. Naturally it‘s found in the Unary Relationship.

(Section 4.1.11) Associative (Composite or Bridge) Entities

Relational models generally requires the use of 1:M relationships. (Also, recall that the 1:1 relationship has

its place, but it should be used with caution and proper justification.) If M:N relationships are encountered,

you must create a bridge between the entities that display such relationships. The Associative Entities are

used to implement a M:M relationship between two or more entities.

(Section 4.2) Developing an ER diagram

Iterative process: Repetition of processes and procedures. The business rules define the ERD components.

Building an ERD involves the following:

Create a detailed narrative of the organization‘s description of operations.

Id the business rules based on the description of the operations.

Id the main entities and relations from the business rules.

Develop the initial ERD

Id the attributes and PK that adequately describe the entities.

Revise and review the ERD.

(Section 4.3) DB design challenges: Conflicting goals DB designer often must make design compromises that are triggered by conflicting goals, such as

adherence to design standards or elegance, processing speed, and info requirements.

Design Standards: To guide you in developing logical structures that minimize data redundancies,

avoiding nulls to the greatest extent, and allows you to work with well defined components and to

evaluate the interaction of those components with some precision.

Processing Speed: Its top priority for large numbers of transactions. Means minimal access time,

which may be achieved by minimizing the number and complexity of logically desirable

relationships. A perfect design may use 1:1 relationships to avoid nulls, while a higher transaction

seeped design might combine the two tables to avoid the use of an additional relationship, using

dummy entries to avoid nulls.

Information Requirements: Complex info requirements may dictate data transformations, and they

may expand the number of entities and attributes within the design. Therefore, the db may have to

sacrifice some of its ―clean‖ design structures and or some of its high transaction speed to ensure

max info generation.

Business rules are an important element of database design in every organization. Business rules drive all

business processes, and an organization's business rules must be correctly implemented by the

organization's IT systems, including the databases.

Business rules are precise statements, derived from a detailed description of the organization's operations.

When written properly, business rules define one or more of the following modeling components:

entities

relationships

attributes

connectivities

cardinalities

constraints

Because the business rules form the basis of the data-modeling process, precisely phrasing them is crucial

to the success of the database design. Because the business rules are derived from a precise description of

operations, much of the design's success depends on the accuracy of the description of operations.

Examples of business rules are:

An invoice contains one or more invoice lines.

Each invoice line is associated with a single invoice.

A store employs many employees.

Each employee is employed by only one store.

A college has many departments.

Each department belongs to a single college. (This business rule reflects a university that has

multiple colleges such as Business, Liberal Arts, Education, Engineering, etc.)

A driver may be assigned to drive many different vehicles.

Each vehicle can be driven by many drivers. (Note: Keep in mind that this business rule reflects

the assignment of drivers during some period of time.)

A client may sign many contracts.

A sales representative may write many sales contracts.

Each sales contract is written by one sales representative.

Each sale involves a sales representative, a customer, and one or more products.

Note that each relationship definition requires the definition of two business rules. For example, the

relationship between the INVOICE and (invoice) LINE entities is defined by the first two business rules in

the bulleted list. This two-way requirement exists because there is always a two-way relationship between

any two related entities. (This two-way relationship description also reflects the implementation by many of

the available CASE tools.) The last business rule above describes a three-way sale relationship between

sales representatives, customers, and products.

------------------------------------------------------------------------------------------------------------ ---------------------

Chapter 5: Normalization of DB tables

(Section 5.1) DB tables and Normalization

Normalization: is the answer to recognize a poor table structure and how to produce a good table. It‘s a

process of evaluating and correcting table structures to minimize data redundancies, thereby reducing the

likelihood of data anomalies. It involves assigning attributes to tables based on the concept of

determination. It is a sequence of tests that are applied to candidate entities and their attributes. Works

through a series of stages called Normal Forms. They are First Normal (1NF), Second Normal Form

(2NF), and Third Normal Form (3NF). 3NF is the highest. 3NF is not always the way to go because it can

effect fast performance.

With excessive normalization it can result in less easily understood entities and slow processing speed.

Denormalization produces a lower normal form; that is, a 3NF will be converted to a 2NF through

denormalization. However the price you pay for increased performance through denormalization is greater

data redundancy.

(Section 5.2) The Need for Normalization

It‘s to decrease anomalies and eliminate data redundancies. It‘s critical to a successful db design. The goal

of normalizations is to create a table such that all non-key attributes are dependent on the PK and nothing

but the PK.

(Section 5.3) The Normalization Process

The objective of Normalization is to ensure that each table conforms to the concept of well formed

relations, that is, tables that have the following characteristics.

Each table represents a single subject. For example a student tables will contain only student data.

No data item will be unnecessarily stored in more than one table. This is to ensure data is updated

in only one place.

All nonprime attributes in a table are depended on the PK. This is to ensure that the data are

uniquely identifiable by a primary key value.

Each table is void of insertion, update, or deletion anomalies. This is to ensure the integrity and

consistency of the data.

First Normal Form (1NF): Table format, all key attributes defined with no repeating groups, and

PK identified. All remaining attributes are dependent on the PK. It still may contain partial

dependencies. Dependencies based on only part of the PK. All repeating groups must be removed

means that each row in a table must define only a single entity. To do this, the appropriate entry

must be added to the PK column.

Second Normal Form (2NF): It‘s in 1NF and includes no partial dependencies. It may contain

Transitive dependencies based on attributes that are not part of the PK. The table can then be put

into a 2NF by ensuring no attribute is dependent on only part of the primary key. If this partial

dependency exists, a new table can be created with a primary key equal to the required portion of

the original key. The dependent attributes are moved to this table. If the 2NF table has any

transitive dependencies, the dependencies can be eliminated by breaking them off and storing

them in a separate table.

Third Normal Form (3NF): It‘s in 2NF and includes no Transitive dependencies.

Boyce-Codd Normal Form (BCNF): Every determinant is a candidate key (special case of 3NF).

If a 3NF table has only a single candidate key, it‘s automatically in BCNF. It can be violated only

if the table contains more than one candidate key.

Fourth Normal Form (4NF): It‘s in 3NF or BCNF and no independent multi-valued

dependencies. Splitting the table to remove all multi-valued dependencies.

(5NF) & (DKNF) are not likely to be encountered in a business environment and are mainly of

theoretical interest.

The normalizations process works one relation at a time. It starts by identifying the dependencies

of a given relation and progressively breaking up the relation into a set of new relations based on

the identified dependencies.

Update Anomaly: When you modify duplicate data in the system. It runs the risk of the data to

being properly modified.

Insert Anomalies: You can‘t insert data due to missing info (especially a missing key!)

Delete Anomaly: You can‘t delete data without deleting other essential data (Data that you don‘t

want to delete)

Functional Dependency: Before outlining normalization process, it is good to review the concepts of

determination and functional dependency. Check table 5.3.

(Section 5.3) Conversion to First Normal Form (1NF)

Repeating groups: Derives its name from the fact that a group of multiple entries of the same type can

exist for any single key attribute occurrence.

The normalization process starts with a simple three-step procedure.

Step 1: Eliminate the Repeating Groups: Eliminate the nulls by making sure that each repeating

group attribute contains an appropriate data value. That change converts the table to 1NF.

Step 2: Identify the PK:

Step3: Identify all Dependencies: Dependency Diagram: Helpful in getting a bird‘s-eye view of all of the relationships among a table‘s

attributes, and their use makes it less likely that you will overlook an important dependency.

Partial Dependency: A dependency based on only a part of a composite primary key that determines other

attributes. They are used for performance reasons; they should be used with caution. A table that contains

partial dependencies is still subject to data redundancies and to various anomalies.

Transitive Dependency: A dependency of one nonprime attribute on another nonprime attribute.

The problem is that they still yield data anomalies. It exists because a nonkey attribute determines the

values of another nonkey attribute.

(Section 5.3.2) Conversion to Second Normal Form (2NF)

Converting to 2NF is done only when the 1NF has a composite PK. If the 1NF has a single attribute PK,

then the table is automatically in 2NF.

Prime or Key Attribute: Any attribute that is at least part of a key.

Nonprime or Nonkey attribute: is not part of any key.

(Section 5.3.3) Conversion to Third Normal Form (3NF)

Step1: Identify each new Determinant: A Determinant is any attribute whose value determines

other values within a row. IF there are three different transitive dependencies, you will have three

different determinants.

Step2: Identify the Dependent Attributes:

Step3: Remove the Dependent Attributes from transitive dependencies.

(Section 5.4) Improving the design

Evaluate PK Assignments:

Surrogate Key: An artificial PK introduced by the designer with the purpose of simplifying the assignment

of PK to tables. They are usually numeric, and often automatically generated by the DBMS.

Evaluate Naming Conventions: It is better to change the attribute name to reflect the table name. Table =

JOB, and attribute should be from CHG_HOUR to JOB_CHG_HOUR.

Refine Attribute Atomicity: Atomic Attribute is one that cannot be further subdivided or decomposed.

By improving the degree of atomicity, you also gain querying flexibility.

Identify new Attributes:

Identify New Relationships:

Refine PK as required for Data Granularity:

Granularity refers to the level of detail represented by the values stored in a table‘s row. Data stored at

their lowest level of granularity are said to be atomic data.

Maintain Historical Accuracy: Evaluate Using derived attributes: The availability of the derived attribute will save reporting time.

(Section 5.5) Surrogate key considerations

At the implementation level a surrogate key is a system defined attribute generally created and managed via

the DBMS.

(Section 5.6) Higher Level Normal Forms

Tables in 3NF will perform suitably in business transactional db.

(Section 5.6.1) The Boyce-Codd Normal Form (BCNF)

This is a special case of the 3NF. A table is in BCNF when every determinant in the table is candidate key.

A BCNF can be violated only when the table contains more than one candidate key. When a nonkey

attribute is the determinant of a key attribute, the condition does not violate 3NF, yet it fails to meet the

BCNF requirements because BCNF requires that every determinant in the table be a candidate key.

(Section 5.6.2) Fourth Normal Form (4NF)

All attributes must be dependent on the PK, but they must be independent of each other.

No row may contain two or more multivalued facts about an entity.

A table is in $NF when it is in 3NF and has no multiple sets of multivalued dependencies.

(Section 5.7) Normalizations and DB design

ER modeling and Normalization are difficult to separate and the two are used in an iterative and

incremental process. ER diagram looks at the "big picture" and normalization provides a "micro" view of

individual entities.

Normalization takes place in tandem with data modeling. The proper procedure is to follow these

steps:

1) Create a description of operations at an appropriate level of detail.

2) Derive appropriate business rules from the description of operations.

3) Model the data with the help of a tool such as Visio's Crow's Foot option to produce an initial

ERD. This ERD is the initial database blueprint.

4) Use the normalization procedures to identify and remove data redundancies. This process

may produce additional entities.

5) Revise the ERD created in step 3.

6) Use the normalization procedures to audit the revised ERD. If significant additional data

redundancies are discovered, repeat steps 4 and 5.

(Section 5.8) Denormalization

It is important to remember that the optimal relational db implementation requires that all tables be at least

in 3NF. The problem with normalization is that as tables are decomposed to conform to normalization

requirements, the number of db tables expands. Therefore, in order to generate info, data must be put

together from various tables. Joining a large number of tables takes additional input/output I/O operations

and processing logic, thereby reducing system speed. Rare and occasional circumstances may allow some

degree of denormalization so processing speed can be increased. The problem with denormalized relations

and redundant data is that the data integrity could be compromised due to the possibility of data anomalies

(insert, update, and deletion anomalies).

Unnormalized tables in a productions db tend to suffer from the following defects.

Data updates are less efficient because programs that read and update tables must deal with larger

tables.

Indexing is more cumbersome. It simply is not practical to build all of the indexes required for the

many attributes that might be located in a single unnormalized table.

Unnormalized tables yield no simple strategies for creating virtual tables known as views.

----------------------------------------------------------------------------------------------------------------------------- ----

Chapter 6: Advanced Data Modeling

(Section 6.1) The EERM

EERM: sometimes referred to as the Enhanced ERM is the result of adding more semantic constructs

(entity supertypes, entity subtypes, and entity clustering) to the original ERM. Entity-relationship modeling

is missing the ability to represent relationships based on specialization and generalization. For example,

you can't directly represent that students and faculty are people in an ERD. This shortcoming is addressed

in Extended Entity-Relationship Modeling, which includes specialization-generalization relationships.

Abstractions, Entities & Classes:

Abstraction means identifying the common characteristics of things and using those common

characteristics to classify or organize things. We used abstraction when we identified entities and produced

Entity-Relationship models.

(Section 6.1.1) Entity supbertypes and subtypes Employee = Supertype Pilot = Subtype because not all employees have the attributes of pilots. This is to

prevent nulls.

Entity Supertype is a generic entity type that is related to on e or more entity subtypes.

Entity Subtypes is where the entity supertype contains the common characteristics, and the entity subtypes

contain the unique characteristics of each entity subtype.

(Section 6.1.2) Specialization Hierarchy

Specialization Hierarchy: A hierarchy that is based on the top-down process of identifying lower-level,

more specific entity subtypes from a higher-level entity supertype. Specialization is based on grouping

unique characteristics and relationships of the subtypes. The relationships depicted within the specialization

hierarchy are sometimes described in terms of ―is-a‖ relationships. Example ―a pilot is an employee‖.

Within a specialization hierarchy, a subtype can exist only within the context of a supertype, and every

subtype can have only one supertype to which t is directly related.

A specialization hierarchy provides the means to:

Support attribute inheritance.

Define a special supertype attribute known as the subtype discriminator.

Define disjoint/overlapping constraints and complete/partial constraints.

In specialization Hierarchies with multiple levels of supertype/subtypes, a lower-level subtype inherits all

of the attributes and relationships from all of its upper-level supertypes.

(Section 6.1.3) Inheritance

Inheritance: enables an entity subtype to inherit the attributes and relationships of the supertype.

One important Inheritance characteristic is that all entity subtypes inherit their primary key attribute from

their supertype.

(Section 6.1.4) Subtype Discriminator

Subtype Discriminator is the attribute in the supertype entity that determines to which subtype the

supertype occurrence is related. The EMP_TYPE attribute is the subtype discriminator because it‘s the

attribute in the supertype that determines to which subtype the supertype occurrence is related.

Note that the default comparison condition for the subtype discriminator attribute is the equality

comparison. However there are situations in which the subtype discriminator is not necessarily based on an

equality comparison.

(Section 6.1.5) Disjoint and overlapping constraints

Disjoint Subtypes or Non-overlapping subtypes are subdues that contain a unique subset of the supertype

entity set; in other words, each entity instance of the supertype can appear in only one of the subtypes.

Overlapping Subtypes are subtypes that contain nonunique subsets of the supertype entity set; that is, each

entity instance for the supertype may appear in more than one subtype. For example a person can be an

employee, student, or both. In turn an employee may be a professor as well as an administrator. Because an

employee also may be a student, student and employee are overlapping subtuypes of the supertype person,

just as professor and admin are overlapping subtypes of the supertype employee.

(Section 6.1.6) Completeness Constraint

Completeness Constraint specifies whether each entity supertype occurrence must also be a member of at

least one subtype. It can be partial or total.

Partial Completeness: (symbolized by a circle over a single line) means that not every supertype

occurrence is a member of a subtype; that is; there may be some supertype occurrences that are not

members of any subtype.

Total Completeness: (symbolized by a circle over a double line) means that every supertype occurrence

must be a member of at least one subtype.

(Section 6.1.2) Specializations and Generalization:

Specialization: is the top-down process of identifying lower-level, more specific entity subtypes from a

higher-level entity supertype. Specialization is based on grouping unique characteristics and relationships

of the subtypes. For example we used specialization to id multiple entity subtypes from the original

employee supertype.

Generalization: Is the bottom-up process of id a higher-level, more generic entity supertype from lower-

level entity subtypes. Generalization is based on grouping common characteristics and relationships of the

subtypes. For example, you might id multiple types of musical instruments: Piano, violin, and guitar.

(Section 6.2) Entity Clustering Entity Cluster: is a ―virtual‖ entity type used to represent multiple entities and relationships in the ERD.

Entity clustering is a technique used to hide potentially confusing detail in an ERD. An entity cluster is

formed by combining multiple interrelated entities into a single abstract entity object. It‘s considered virtual

or abstract in a sense that it is not actually an entity in the final ERD. When using entity clusters, the key

attributes of the combined entities are no longer available. Avoid the display of attributes when entity

clusters are used to prevent problems such as changes in relationships from identifying to non-identifying

or vice versa and the loss of FK attributes from some entities.

(Section 6.3) Entity Integrity: Selecting PK The importance of properly selecting the PK has a direct bearing on the efficiency and effectiveness of db

implementation.

(Section 6.3.1) Natural Keys and PK

Natural Key or Natural Identifier is a real world, generally accepted identifier used to distinguish—that

is, uniquely identify – real world objects.

(Section 6.3.2) PK Guidelines

The function of the PK is to guarantee entity integrity, not to describe an entity.

PK and FK are used to implement relationships among entities.

Desirable primary key characteristics should be UNIQUE VALUES, NONINTELLIGENT, NO

CHANGE OVER TIME, PREFERABLY SING-ATTRIBUTE, PREFERABLY NUMERIC, SECURITY

COMPLIANT, MANIFESTNESS, IMMUTABILITY, COMPACTNESS.

(Section 6.3.3) When to use composite PK

Composite PK‘s are useful in two cases.

As identifiers of composite entities, where each PK combination is allowed only once in the M:N

relationship.

As identifiers of weak entities, where the weak entity has a strong identifying relationship with the

parent entity.

The ENROLL entity mainly represents the many-to-many relationship between students and classes. Such

entities are termed association entities, bridge entities, or composite entities. Note that the table has foreign

keys to both STUDENT and CLASS, and that the primary key is the composite of those two foreign keys.

PK in Existence-Dependent Relationships:

If one entity depends for its existence within the database on one or more other entities, then the existence-

dependent table should include the primary key of all tables upon which its existence depends. As the text

indicates, these existence dependencies can be natural, such as the existence dependence of DEPENDENT

on EMPLOYEE or the existence dependence of GRADED_ITEM on CLASS. Existence dependence can

arise in relational database as a result of the normalization process required to correctly represent composite

entities in a relational database.

(Section 6.3.4) When to use Surrogate PK

It can often be very difficult or impossible to identify correct primary keys for natural entities,

particularly natural events. In these situations the only solution is to have the computer or user create a

unique primary key for each entity that is inserted into the table that represents such an entity. These keys

are called synthetic primary keys or surrogate keys.

It is famously difficult to identify correct natural keys for people, and it is not desirable for

security reasons if you had an ID card that uses your SSN number as your ID #. This is why SSN numbers

are not used as PK.

They are helpful when there is no natural key, when the selected candidate key has embedded

semantic contents, or when the selected candidate key is too long or cumbersome. If you use a surrogate

key you must ensure that the candidate key of the entity n questions performs properly through the use of

―unique index‖ & ―not null‖ constraints.

Integer surrogate keys are the norm in large and high performance databases. This is because

integers are the most compact representation of an identifier that is unique for a number of unique entities,

and because it is usually most efficient for computers to store and operate on integers.

Tables that should not have PK’s:

The text assumes that all tables should have primary keys, but this is not always true. Tables that represent

real-world entities or parts of entities should always have primary keys. Most operational databases in

financial or other sensitive applications include tables whose sole role is to preserve a record of events or

changes to the database. These history or audit tables often record a variety of internal events within the

system which may have no corresponding durable entity in the real world, and no natural key. Clients of

mine have tried in vain to develop a natural primary key for these tables, including as many as a dozen

columns, often with a timestamp to help assure uniqueness. After such an application has run for a while

they have discovered to their dismay that they occasionally have primary key uniqueness violations that

prevent the insertion of a history record.

The problem in these situations is not that they have selected incorrect columns for the natural primary key,

but that such tables usually have no reliable natural primary key, even including a timestamp. The solution

is to not have a primary key. Not having to maintain the unique index for the primary key speeds inserts

into history tables. You may want to index the tables so that you can efficiently retrieve the historic data. If

you do need a primary key, for example if a history table must be referenced from another table, then use a

synthetic (surrogate) primary key.

(Section 6.4) Design cases: Learning Flexible DB Design

Databases are the most long-lived of all software components. Many databases have been continuously

used and updated for decades. Most of the life cycle cost of databases is thus in the maintenance phase of

the database life cycle. Thus the most important characteristic of a database design is that it be easy to

modify as business needs change. There is an old saying in the computer industry that an easily maintained

design that doesn't happen to be completely correct is not a problem, because you can just fix it, but that an

un-maintainable design is a disaster, because even if it works now something will inevitably happen that

will require you to change it, and then you have a big problem. In this section we describe the things that

you can do to make sure that your designs are flexible enough so that they can be easily maintained.

There are two more advanced topics in the design of foreign keys for 1:1 relationships. One consideration

is that modern DBMS including Oracle support clustering of tables which have 1:1 mandatory relationships

and a common key that identifies the rows that are related 1:1. What clustering of tables does is combine

the corresponding logical rows of the clustered tables into one physical row in storage. Because the

clustered rows are actually one physical row there is no need to repeat the shared columns in the cluster

key. As a result when tables are clustered the result is a smaller database I often cluster tables that are

related by a 1:1 mandatory relationship. With clustering the columns in the common cluster key are stored

only once, so the database is smaller. Clustering stores the related rows together in one physical row, so

joining the tables is essentially free, and you can effectively ignore the performance consequences of

joining both tables in requests. Clustering reduces the table size, because there is only primary key.

(Section 6.4.1) Design Case #1: Implementing 1:1 Relationships

FK‘s work with PK‘s to properly implement relationships in the relation model. The basic rule is to put the

PK of the ―one‖ side (the parent entity) on the ―many‖ side (the dependent entity) as a foreign key.

However, where do you lace the FK when you are working with a 1:1 relationship? There are two options.

Place a FK in both entities

Place a FK in one of the entities which is the preferred solution.

(Section 6.4.2) Design Case #2: Maintaining History of Time-Variant Data

Time –Variant data refer to data whose values change over time and for which must keep a history of the

data changes. Keeping the history of time-variant data is equivalent to having a multivalued attribute in

your entity. To model time-variant data, you must create a new entity in a 1:M relationship with the

original entity.

Representing History: One nice feature of designs with a current data table and a corresponding history

table is that the same queries can be run against the current status table (e.g., DEPARTMENT) or against

the corresponding history table (e.g., DEPARTMENT_HIST). Queries can be run against the history table

to return results corresponding to the state at any previous time. For example, we can run a report today as

if it were being run at the end of the previous quarter, reflecting the state of the DEPARTMENT table or

any number of additional tables at that time. This is very useful for many kinds of businesses.

Queries run against the history table will have at least one additional WHERE clause, and often a

subquery. The history table is typically much larger than the current data table. For these reasons queries

against the history table are not as fast as queries against the current data table. This is why it is convenient

to have both tables. The current data table supports current operational transactions, while the

corresponding history table supports historic analysis and reporting. Performance is not so important for

these historic functions, so the extra size and overhead of the history table is acceptable. Note that the

current data table redundantly stores the latest data in the history table, so care must be taken to assure that

these are always consistent. I usually encapsulate updates to the pair of tables in a stored procedure, and

write a stored procedure or script that checks that they are consistent. Triggers can also be used to maintain

the denormalized data. Triggers can be used to add history to an existing database and applications without

requiring changes to existing SQL.

Designing DB History and Audit:

We often need to maintain a record of transactions, for some time after the transactions have been

completed, to support a review of the transactions or for internal or external audit. The requirements for this

history data are quite different from those of the operational database, and consequently the designs for

history and audit tables are correspondingly quite different. The differences in the requirements are

summarized in the following table. On Lecture 6 section 3.11

When history is kept in separate tables those history tables are often quite denormalized, so that each record

in the history table represents the entire event or transaction that it is being recorded for later analysis or

audit. The following table summarizes common denormalizations in history tables:

On Lecture 6 section 3.11

Designing History and Audit Tables:

History and audit tables may record more information than is required for operations. For example, let's

look at what happens at a financial services firm, such as a bank, when any change is made to a customer's

address. While the operational tables only store the new address, the audit tables will record who made the

change, when they made it, from where they made it, and references to any paper documents, voice

recordings or other audit data that may be outside the database.

Historic data is also stored in data warehouses and other decision support systems. Modern data

warehouses store fact data at the level of atomic business transactions, so it is sometimes feasible to use a

data warehouse as the longer-term history and audit repository, but this can be problematic. The following

table summarizes differences between audit database requirements and data warehouse requirements, as

well as the problems created when data warehouses are used as history and audit repositories.

(Section 6.4.3) Design Case #3: Fan Traps

Design Trap: Occurs when a relationship is improperly or incompletely identified and is therefore

represented in a way that is not consistent with the real world. The most consistent design trap is known as

a Fran Trap: It occurs when you have one entity in two 1:M relationships to other entities, thus producing

an association among the other entities that is not expressed in the model. Fan traps occur when fewer

relationships are explicitly represented than matter in the real world and the subset of the relationships that

are represented are not sufficient to infer missing important relationships.

(Section 6.4.4) Design Case #4: Redundant Relationships

Redundant relationships occur when there are multiple relationship paths between related entities. The

main concern with redundant relationships is that they remain consistent across the model. It is important to

note that some designs use redundant relationships as a way to simplify the design.

(Section 6.5) Data Modeling Checklist

This is to ensure one fulfills data modeling tasks successfully. The checklist is on page 212 on the text.

Some checklist for generalization-specialization not mentioned in the text:

Verify that all attributes of the superclass are needed in all subclasses

Verify that all common attributes of subclasses have been correctly migrated to the superclass.

Verify that domain experts agree that the subclasses are really specializations of the superclass.

Verify that the business rules associated with the superclass really do apply to all subclasses.

----------------------------------------------------------------------------------------------------------------------------- ----

Chapter 8: Advanced SQL

(Section 8.7) Procedural SQL

Persistent stored module (PSM) is a block of code containing standard SQL statements and procedural

extensions that is stored and executed at the DMBS server.

Procedural SQL (PL/SQL) is a language that makes it possible to use and store procedural code and SQL

statements within the db and to merge SQL and traditional programming constructs, such as variables,

conditional processing (IF-THEN-ELSE), basic loops (FOR and WHILE) and error trapping.

Anonymous PL/SQL block

PL/SQL starts with a DECLARE section.

CHAR

VARCHAR2

NUMBER DATE

%TYPE

WHILE Loop

END LOOP

|| to display the output.

(Section 8.7.1) Triggers

Trigger is procedural SQL code that is automatically invoked by the RDBMS upon the occurrence of a

given data manipulation event.

A trigger is invoked before or after a data row is inserted, update, or deleted

A trigger is associated with a db table

Each db table may have one or more triggers

A trigger is executed as part of the transaction that triggered it

Triggers are critical to proper db operations and management

Triggers can be used to enforce constraints that cannot be enforced at the DBMS design and

implementation levels.

Triggers add functionality by automating critical actions and providing appropriate warnings and

suggestions for remedial action. In fact, one of the most common uses for triggers is to facilitate

the enforcement of referential integrity.

Triggers can be used to update table values, insert records in tables, and call other stored

procedures.

Triggers play a critical role in making the db truly useful; they also add processing power to the RDBMS

and to the db system as a whole. Oracle recommends triggers for:

Auditing purposes creating audit logs.

Automatic generation of derived column values

Enforcement of business or security constraints

Creation of replica tables for backup purposes.

Statement Level Triggers: Is assumed if you omit the FOR EACH ROW keywords. This trigger is

executed once, before or after the triggering statement is complete. This is the default case.

Row Level Trigger: Requires use of the FOR EACH ROW keywords. This type of trigger is executed

once for each row affected by the triggering statement. If you update 10 rows the trigger executes 10 times.

----------------------------------------------------------------------------------------------------------------------------- ----

Chapter 9: Database Design

(Section 9.1) IS

Info System: data collection, storage, and retrieval. It also facilitates the transformation of data into info,

and it allows for the management of both data and info.

System Analysis: is the process that establishes the need for and the extent of an info system.

System Development: is the process of creating an IS. When apps transform data into info for decision

making.

Every app is composed of two parts: Data and the code.

Performance of an IS depends on a triad of factors.

DB design and Implementation

App design and implementation

Admin procedures

Db Development: The process of db design and implementation.

Db Design: Primary objective is to create complete, normalized, non-redundant (to the extent possible),

and fully integrated conceptual, logical, and physical db models.

(Section 9.2) System Development Life Cycle (SDLC)

SDLC: Traces the history life cycle of an IS. Provides the big picture.

It‘s divided into 5 phases, Panning, Analysis, Detailed system Design, Implementation, & Maintenance.

SDLC is an iterative process.

(Section 9.2.1) Planning

Planning: Yields a general overview of the company and its objectives.

Should the existing system be continued?

Should the existing system be modified?

Should the existing system be replaced?

If the new system is necessary, the next question is whether it is feasible. The feasibility study must address

the following:

The technical aspects of hardware and software requirements.

The system cost.

The operational cost.

(Section 9.2.2) Analysis

The problems defined during the planning phase are examined n greater detail during the analysis phase.

What are the requirements of the current system‘s end users?

Do those requirements fit into the overall info requirements?

During analysis phase is a through audit of user requirements.

The existing hardware/software systems are also studied during the analysis phase.

DB data modeling activities take place, DFD HIP diagrams.

(Section 9.2.3) Detailed Systems Design

The designer completes the design of the system‘s process. Includes screens, menus, reports, and other

devices that might be used to help make the system am ore efficient info generator.

(Section 9.2.4) Implementation

During this phase, the hardware, DMBS software, and app programs are installed, and the db design is

implemented. The system enters into a cycle of coding, testing, and debugging until it is ready to be

delivered.

(Section 9.2.5) Maintenance

Corrective maintenance in response to systems errors.

Adaptive maintenance due to changes in the business environment.

Perfective maintenance to enhance the system.

Computer-aided systems engineering (CASE): technology such as system Architect or Visio helps make

it possible to produce better systems within a reasonable amount of time and at a reasonable cost.

(Section 9.3) The DB life cycle (DBLC)

DBLC Contains 6 phases: Db initial study, db design, implementation & loading, testing & evaluation,

operation, & maintenance and evolution.

(Section 9.3.1) The DB Initial Study Analyze the company situation, define problems and constraints, and define objectives, scope, &

boundaries. The purpose of DB initial Study is to:

Analyze the company situation:

This describes the general conditions in which a company operates its organizational structure, and its

mission. These issues must be resolved:

What is the org general operating environment, and what is its mission within that environment?

What is the org structure?

Define Problems & Constraints:

How does the existing system function?

What input does the system require?

What docs does the system generate?

By whom and how is the system output used?

Define Objectives:

What is the proposed system‘s initial objective?

Will the system interface with other existing or future systems in the company?

Will the system share the data with other systems or users?

Define Scope and Boundaries:

Scope defines the extent of the design according to operational requirements.

Will the db design encompass the entire org, one or more departments within the org, or one or

more functions of a single department?

Boundaries: limits of the proposed system. They are external to the system.

The scope and boundaries become the factors that force the design into a specific mold, and the designer‘s

job is to design the best system possible within those constraints.

(Section 9.3.2) DB design

In the process of db deign, you must concentrate on the data characteristics required to build the database

model. At this point, there are two views of the data within the system: The business view of data as a

source of info and the designer‘s view of data within the system: The business view of data as a source of

info and the designer‘s view of the data structure, its access, and the activities required to transform the

data into info. Defining data is an integral part of the DBLC second phase.

1: Conceptual Design: data modeling is used to create an abstract db structure that represents real world

objects in the most realistic way.

Four steps:

Data analysis and requirements

Entity relationship modeling and normalization

Data model verification

Distributed db design

Minimal data rule: all that is needed is there, all that is there is needed. Make sure that all data needed are

in the model and that all data in the model are needed.

Data analysis and requirement:

The first step in conceptual design is to discover the characteristics of the data elements. Designer is

focused on:

Information needs

Information users

Information sources

Information constitution such as what data elements are needed to produce the info?

The designer obtains the answers by the following:

Developing and gathering end-user data views

Directly observing the current system

Interfacing with the systems design group

From a db point of view, the collection of data becomes meaningful only when business rules are defined.

Description of operations is a doc that provides a precise, up-to-date, and thoroughly reviewed description

of the activities that define an org operating environment. To db designer operating environments is both

data sources and the data users.

Entity Relationship Modeling and Normalization:

ER model is a comm tool as well as a design blueprint.

During the ER modeling process, the designer must:

Define entities, attributes, pk, fk.

Make decisions about adding new pk attributes to satisfy end-user and or processing requirements.

Make decisions about the treatment of multi-valued attributes.

Make decisions about adding derived attributes to satisfy processing requirements.

Make decisions about the placement of fk in 1:1 relationships. Avoid unnecessary ternary

relationships.

Draw the corresponding ER diagram.

Include all data element definitions in the data dic.

Make decisions about standard naming conventions.

Data Model Verification:

The ER model must be verified against the proposed system processes in order to corroborate that the

intended processes can be supported by the db model.

ER model verification process:

1) Id the ER models central entity.

2) Id each module and its components

3) Id each module transaction requirements: Internal: updates/inserts/deletes/queries/reports

External: module interfaces.

4) Verify all processes against the ER model.

5) Make all necessary changes suggested in step 4.

6) Repeat steps 2-5 for all modules.

Module: IS component that handles a specific function, such as inventory, orders, payroll, and so on. At

the design level, a module is an ER segment that is an integrated part of the overall ER model. They speed

up development work, simplify the design work, and can be prototyped quickly. Think of this as a WBS.

Disadvantage is that it does create fragmentations, which creates potential problem: The fragments might

not include all of the ER model components and might not, therefore, be able to support all of the required

processes. To avoid this issue the models must be verified against the complete ER Model.

Within the central entity/module framework you must:

A module must display high Cohesivity which describes the strength of the relationships found

among the module‘s entities.

Module coupling describes the extent to which modules are independent of one another. Modules

must display low coupling, indicating that they are independent of other modules.

Process may be classified according to their:

Frequency: Daily, weekly, monthly, yearly, or exceptions.

Operational type: Insert or Add, Update or Change, Delete, queries and reports, batches, maintenance, and

backups.

2 DBMS Software Selection:

Some of the common factors affecting the purchasing decision are:

Cost:

DBMS features and tools:

Underlying model:

Portability:

DBMS hardware requirements:

3 Logical Design:

It translates the conceptual design into the internal model for a selected db management system. Therefore

the logical design is software dependent.

Physical Design:

Is the process of selecting the data storage and data access characteristics of the database. It becomes more

complex when data are distributed at different locations because the performance is affected by the comm.

Medias throughput.

(Section 9.3.3) Implementation and Loading

Create db storage group. Sysadmin

Create db within the storage group. Sysadmin

Assign the rights to use the db to a db admin. DBA

Create the table space within the db. DBA

Create the table within the table space. DBA

Assign access rights to the table spaces and to the tables within specified table spaces. DBA

You also must address performance, security, backup and recovery, integrity, and company

standards.

Performance: DB size will affect performance. Important factors in db performance also include system

and db config parameters, such as data placement, access path definition, the use of indexes, and buffer

size.

Security: Data stored in the company db must be protected from access by unauthorized users.

Physical Security allows only authorized personnel physical access to specific areas.

Password Security allows the assignment of access rights to specific authorized users.

Access rights can be established through the use of db software.

Audit Trails Usually provided by the DBMS to check for access violations.

Data Encryption can be used to render data useless to unauthorized users who might have

violated some of the db security layers.

Diskless Workstations allows end users to access the db without being able to download the info

from their workstations.

Backup & Recovery:

Full Backup or dump of the entire db.

Differential Backup, in which only the last modifications to the db are copied. Only the objects

that have been updated since the last full backup are backed up.

Transaction log backup, which backs up only the transaction log operations that are not reflected

in a previous backup copy f the db. The backup copy is then rolled forward to restore all

subsequent transactions by using the transaction log info. If the db needs to be recovered but the

committed portion of the db is still usable, the recovery process uses the transaction log to undo all

to the transactions that were not committed.

Integrity: Is enforced by the DBMS through the proper use of PK and FK rules.

Company Standards: DB standards may be partially defined by specific company requirements. The db

admin must implement and enforce such standards.

(Section 9.3.4) Testing and Evaluation:

Once the date has been loaded into the db, the DBA tests and fine-tunes the db for performance, integrity,

concurrent access, and security constraints. The testing and evaluation phase occurs in parallel with app

programming.

(Section 9.3.5) Operation

Once the db has passed the evaluation stage, it is considered to be operational. The beginning of the

operational phase invariably starts the process of system evolution. As soon as all the end users have

entered the operations phase, problems that could not have been foreseen during the testing phase being to

surface.

(Section 9.3.6) Maintenance & Evolution

Preventive maintenance (Backup)

Corrective maintenance (Recovery)

Adaptive maintenance (enhancing performance, adding entities and attributes, and so on)

Assignment of access permissions and their maintenance for new and old users.

Generation of db access statistics to improve the efficiency and usefulness for system audits and to

monitor system performance.

Periodic security audits based on the system generated statistics.

Periodic (monthly, quarterly, or yearly) system-usage summaries for internal billing or budgeting

purposes.

(Section 9.4) Database Design Strategies

There are two classical approaches to db design

Top-down design starts by id the data sets, and then defines the data elements for each of those

sets. This process involves the id of different entity types and the definition of each entity‘s

attributes.

Bottom-up design First id the data elements (items), and then groups them together in data sets. It

first defines attributes and the groups then to form entities.

(Section 9.5) Centralized vs. Decentralized design

Centralized Design: is productive when the data component is composed for a relatively small number of

objects and procedures. The design can be carried out and represented in a fairly simple db.

Decentralized Design: It might be used when the data component of the system has a considerable number

of entities and complex relations on which very complex operations are performed. The task is divided into

several modules. Takes more than one person.

----------------------------------------------------------------------------------------------------------------------------- ----

Chapter 10: Database Design

(Section 10.1) What is a Transaction?

Sales transactions consist of at least the following parts:

You must write a new customer invoice

You must reduce the quantity on hand in the product‘s inventory

You must update the account transactions

You must update the customer balance.

Transaction: is any action that reads from and /or writes to a database. It can consist of a simple SELECT

statement to generate a list of table contents; it may consist of a series of UPDATE statements to change

the values of attributes in various tables; it may consist of a series of INSERT statements to add rows to

one or more tables. It can also be a combination of all those statements. It must be entirely completed or

aborted. If any of the SQL statements fail the entire transaction is rolled back to the original db state.

Consistent database state: is one in which all data integrity constraints are satisfied. To ensure

consistency of the db, every transaction must begin with the db in a known consistent state. If the db is not

in a consistent state, the transaction will yield an inconsistent db that violates its integrity and business

rules.

Most real world db transactions are formed by two or more db requests. A Database Request is the

equivalent of a single SQL statement in an app program or transaction. If a transactions is composed of two

UPDATE statements and one INSERT statement, the transactions uses three db requests.

(Section 10.1.1) Evaluating Transaction Results

If a db existed in a consistent state before the access, the db remains in a consistent state after the access

because the transition did not alter the db. If a transaction was interrupted by loss of electric power, it will

roll the db back to its previous consistent state. Improper or incomplete transactions can have a devastating

effect on db integrity.

(Section 10.1.2) Transaction Properties

Each individual transaction must display ACID.

Atomicity requires that all operations or components of a transaction to be fully completed or

aborted.

Consistency initiates the consistent state of a database after the completion of the transaction.

Isolation initiates the assurance of data used during the execution of a transaction where the data

cannot be used by a second transaction until the first one is completed.

Durability ensures that when transaction changes are completed, they cannot be undone or lost,

even during a system malfunction or crash.

Serializability: Ensures that the schedule for the concurrent execution of the transactions yields

consistent results. This is important in multi-user and distributed db‘s, where multiple transactions

are likely to be executed concurrently.

A single user db automatically ensures serializability and isolation of the db because only one transaction is

executed at a time.

(Section 10.1.3) Transaction Management with SQL

The American National Standards Institute (ANSI) has defined standard that govern SQL db

transactions. Transaction support is provided by two SQL statements: COMMIT & ROLLBACK.

(Section 10.1.4) The Transaction Log

Transaction Log keeps track of all transactions that update the db. The info stored in the log is used by the

DBMS for a recovery requirement triggered by a ROLLBAK statement, a program‘s abnormal termination,

or a system failure such as a network discrepancy or a disk crash. The log contains the following:

Record for the beginning of the transaction.

For each transaction component (SQL statement):

-The type of operation being performed (update, delete, insert)

- The names of the objects affected by the transaction (the name of the table)

- The ―before‖ and ―after‖ values for the fields being updated.

- Pointer to the previous and next transaction log entries for the same transaction.

The ending (CMMIT) of the transaction.

If a system failure occurs, the DBMS will examine the transaction log for all uncommitted or incomplete

transactions and restore (ROLLBACK) the db to its previous state on the basis of that info. Committed

transactions are not rolled back. T-Logs are subject to common dangers such as disk-full condition and disk

crashes.

(Section 10.2) Concurrency Control

Concurrency Control is the coordination of the simultaneous exe of transactions in a multi-user db

system. Its objective is to ensure the serializability of transactions in a multi-user db environment.

(Section 10.2.1) Lost Updates

Lost Update problem occurs when two concurrent transactions, T1 & T2, are updating the same data

element and one of the updates is lost or overwritten by the other transaction.

(Section 10.2.2) Uncommitted Data

Uncommitted data occurs when two transactions, T1 & T2, are executed concurrently and the first

transaction T1 is rolled back after the second transaction T2 has already accessed the uncommitted data,

thus violating the isolation property of transactions.

(Section 10.2.3) Inconsistent Retrievals

Inconsistent Retrievals occur when a transaction accesses data before and after another transaction (s)

finish working with such data. For example, an inconsistent retrieval would occur if transaction T1

calculates some summary (aggregate) function over a set of data while another transaction T2 is updating

the same data.

(Section 10.2.4) The Scheduler

What would happen if two transactions execute concurrently and they are accessing the related data or the

same data? How is the correct order determined and who determines that order? The DBMS handles that by

using a built-in scheduler.

Scheduler is a special DBMS process that establishes the order in which the operations within concurrent

isolation of transactions. To determine the appropriate order, the scheduler bases its actions on concurrency

control algorithms, such as locking or time stamping methods. Not all transactions are serializable, the

DBMS determines that. The scheduler‘s main job is to create a serializable schedule of a transaction‘s

operations. The scheduler facilitates data isolation to ensure that two transactions do not update the same

data element at the same time.

Serializable Schedule is a schedule of a transaction‘s operations in which the interleaved exe of the

transactions yields the same results as if the transactions were exe in serial order one after another.

(Section 10.3) Concurrency Control with Locking Methods

Lock guarantees exclusive use of a data item to a current transaction. It means that T2 does not have access

to a data item that is currently being used by transaction T1. A transaction acquires a lock prior to data

access; the lock is released (unlocked) when the transaction is completed so that another transaction can

lock the data item for its exclusive use.

Lock Manager is responsible for assigning and policing the locks used by the transactions.

(Section 10.3.1) Lock Granularity

Lock Granularity indicates the level of lock use. Locking can take place at the following levels: db, table,

page, row, or even field (attribute).

Database Level

In a Database-Level Lock, the entire db is locked, thus preventing the use of any table sin the db b

transaction T2 while transaction T1 is being exe. Its good for batch process, but not suitable for multi-user

DBMS.

Table Level

In a Table Level, the entire table is locked, preventing access to any row by transaction T2 while

transaction T1 is using the table. Two transactions can access the same db as long as it‘s not the same table.

They are less restrictive than the DB levels, but they do cause traffic jams when transactions are waiting to

access the same table. No suitable for multi-user DBMS.

Page Level

In a Page Level Lock, the DBMS will lock an entire diskpage, or page, is the equivalent of a diskblock,

which can be described as a directly addressable section of a disk. A page has a fixed size. It is currently

the most frequently used multi-user DBMS locking method. T1 and T2 access the same table while locking

different diskpages. If T2 requires the use of a row located on a page that is locked by T1, T2 must wait

until the page is unlocked by T1.

Row Level

A Row-Level Lock is much less restrictive than the locks discussed earlier. The DBMS allows concurrent

transactions to access different rows of the same table even when the rows are located on the same page. Its

management requires high overhead because a lock exists for each row in a table of the db involved in a

conflicting transaction. T2 must wait only if it requests the same row as T1.

Field Level

The Field-Level Lock allows concurrent transactions to access the same row as long as they require the use

of different fields within that row. It is rarely implemented in a DBMS because it requires an extremely

high level of computer overhead and because the row-level lock is much more useful in practice.

(Section 10.3.2) Lock Types

Binary Locks

A Binary Lock has only two states: Locked (1) or Unlocked (0). They are considered too restrictive to

yield optimal concurrency conditions. For example the DBMS will not allow two transactions read the

same db object, even though neither transaction updates the db and therefore, no concurrency problems can

occur.

Shared/Exclusive Locks

The labels ―Shared‖ & ―Exclusive‖ indicate the nature of the lock.

Exclusive Lock exists when access is reserved specifically for the transaction that locked the object. It

must be used when the potential for conflict exists. An Exclusive Lock is issued when a transaction wants

to update (write) a data item and no locks are currently held on that data item by any other transaction.

Shared Lock exists when concurrent transactions are granted read access on the basis of a common lock. A

shared lock produces no conflict as long as all the concurrent transactions are read only. A Share Lock is

issued when a transaction wants to read data from the db and no Exclusive Lock is held on that data item.

Using the Shared/Exclusive locking concept, a lock can have 3 states: Unlocked, Shared (read), and

Exclusive (write). Two transactions conflict only when at least one of them is a Write transaction. Shared

locks allow several Read transactions to read the same data item concurrently.

Internet Shared lock vs. Exclusive lock

An exclusive lock allows only one user/connection to access (read or update) a particular piece of data. A

shared lock allows multiple users to read data, but doesn't allow any of them to update the data.

If a user is updating data (as in our bank example) and is using pessimistic concurrency control (i.e.

locking), then that user must acquire an "exclusive" lock. No other user may read or update that data (e.g.

bank account record) while the exclusive lock is held. In addition, if you are using pessimistic concurrency

control, no other user may even view the record that has been exclusively locked. That prevents a user from

seeing, for example, a mix of updated data and not-yet-updated data. At any given time, only one user may

have an exclusive lock on a particular piece of data.

If both users only want to read (not change) the data, then each user can use a "shared" lock. For example,

if I am reading, but not updating, a record, then another user can look at that record at the same time. Many

users may have shared locks on the same item (record, table, etc.) at the same time. For example, you, your

spouse, your banker, and a credit rating agency could all look at your checking account balance

simultaneously, as long as none of you try to change it at the same time.

Shared and exclusive locks cannot be mixed. If you have an exclusive lock on a record, I cannot get a

shared lock (or an exclusive lock) on that same record.

Mutual Exclusive Rule: Only one transaction at a time can own an exclusive lock on the same object.

Shared/Exclusive lock schema increases the lock manager‘s overhead.

Although locks prevent serious data inconsistencies, he ca lead to two major problems:

The resulting transaction sched might no be serializable.

The sched might create deadlocks. A db Deadlock , which is equivalent to traffic gridlock in a big

city, is caused when two or more transactions wit for each other to unlock data.

(Section 10.3.3) Two-Phase Locking To Ensure Serizalizability

Two-Phase Locking: defines how transactions acquire and relinquish locks. Two-phase locking guarantees

serializability, but it does not prevent deadlocks. The two phases are:

A growing phase, in which a transaction acquires all required locks without unlocking any data.

Once all locks have been acquired, the transaction is in its locked point.

A shrinking phase, in which a transaction releases all locks and cannot obtain any new lock.

Two-Phase locking protocol is governed by the following rules:

Two transactions cannot have conflicting locks

No unlock operation can precede a lock operation in the same transaction

No data are affected until all locks are obtained-that is, until the transaction is in its locked point.

(Section 10.3.4) DeadLocks

Deadlock occurs when two transactions wait indefinitely for each other to unlock data. Also known as the

Deadly Embrace. Deadlocks are possible only when one of the transactions wants to obtains an exclusive

lock on a data item. No deadlock condition can exist among shared locks. Example:

T1 = access data items X and Y

T2 = access data items Y and X

3 basic techniques to control deadlocks are:

Deadlock Prevention:

Deadlock Detection:

Deadlock Avoidance:

(Section 10.4) Concurrency Control with Tie Stamping Methods

Time Stamping approach to scheduling concurrent transactions assigns a global, unique time stamp to

each transaction. The time stamp value produces an explicit order in which transactions are submitted to the

DBMS. They have two properties: Uniqueness and monotonicity. All db operations (READ & WRITE)

within the same transaction must have the same time stamp. If two transactions conflict, one is stopped,

rolled back, rescheduled, and assigned a new time stamp value.

Disadvantage: each value stored in the db requires two additional time stamp fields: one for the lat time the

field was read and one for the last update. It also increases memory needs and the db processing overhead.

Time stamping demands a lot of system resources because many transactions might have to be stopped, re-

scheduled, and re-stamped.

Monotonicity: Ensures that time stamp values always increase.

(Section 10.4.1) Wait/Die and Wound/Wait Schemes

Two schemas that choose which transaction is rolled back and which continues.

Wait/Die Scheme: Is the older transaction and waits for the younger to complete and release its locks.

Wound/Wait Scheme: The older transaction rolls back the younger transaction and reschedules it.

(Section 10.5) Concurrency Control with Optimistic Methods

Optimistic Approach is based on the assumption that the majority of the db operations do not conflict. It

requires neither locking nor time stamping techniques. Instead, a transaction is exe without restrictions until

it is committed. In this approach, each transaction movies through two or three phases, referred to as road,

validation, and write.

Read stage: The transaction reads the db, exe the needed computations, and makes the updates to

a private copy of the db values. All update operations of the transaction are recorded in a

temporary update file, which is not accessed by the remaining transactions.

Validation Stage: The transaction is validated to ensure that the changes made will not affect the

integrity and consistency of the database. If the validation test is positive, the transaction goes to

the write phase. If the validation test is negative, the transaction is restarted and the changes are

discarded.

Write Stage: The changes are permanently applied to the db.

(Section 10.6) Database Recovery Management

Database Recovery restores a db from a given state (usually inconsistent) to a previously consistent state.

Atomic Transaction Property: All portions of the transaction must be treated as a single, logical unit of

work in which all operations are applied and completed to produce a consistent db.

Transaction recovery reverses all of the changes that the transaction made to the db before the transaction

was aborted.

Critical errors can cause a db to become non-operational and compromise the integrity of the data.

Examples:

Hardware/Software failures

Human caused incidents

Natural disasters

(Section 10.6.1) Transaction Recovery

Write-Ahead-Log Protocol: Ensures that transaction logs are always written before any db data are

actually updated. This protocol ensures that, in case of a failure, the db can later be recovered to a

consistent state, using the data in the transaction log.

Redundant Transactions Logs: (Several copies of the transaction log) Ensures that a physical disk failure

will not impair the DBMS ability to recover data.

Db Buffers: Temp storage areas in primary memory used to speed up disk operations.

Checkpoints: Operation in which the DBMS writes all of its updated buffers to disk. While this is

happening, the DBMS does not execute any other requests. Checkpoint is registered in the transaction log.

The physical db and the transaction log will be in sync. They are automatically scheduled by the DBMS

several times per hour.

Deferred-Write technique or Deferred Update: transaction operations do not immediately update the

physical db. Instead only the transaction log is updated. The db is physically updated only after the

transaction reaches its commit point, using info from the transaction log. If the transaction aborts before its

reaches its commit point, no changes (no ROLLBACK or undo) need to be made to the db because the

transaction was never updated.

Write-through Technique or Immediate Update: the db is immediately updated by the transaction

operations during the transaction‘s execution, even before the transaction reaches its commit point. If the

transaction aborts before it reaches its commit point, a ROLLBACK or undo operation needs to be done to

restore the db to a consistent state. In this case the ROLLBACK operation will use the transaction log

before values.

---------------------------------------------------------------------------------------------------------------- --------------

Chapter 11 DB Performance Tuning & Query Optimization

(Section 11.1) DB Performance Tuning Concepts

Database Performance Tuning: refers to a set of activities and procedures designed to reduce the

response time of the db system that is, to ensure that an end user query is processed by the DBMS in the

minimum amount of time.

The performance of a typical DBMS is constrained by three main factors: CPU processing power,

Available primary memory (RAM), and input/output (hard disk and network) throughout.

Good db performance starts with good db design.

(Section 11.1.1) Performance Tuning: Client and Server

Performance tuning can be divided into client and server side.

On the client side, the objective is to generate a SQL query that returns the correct answer in the

least amount of time, using the min amount of resources at the server end. This is refered to as

SQL performance tuning.

On the server side, the DMBS environment must be properly configured to respond to clients

requests in the fastest way possible, while making optimum use of existing resources. This is

referred to as DMBS performance tuning. This is more complex than a two tier client server

config.

(Section 11.1.2)DBMS Architecture

Data files: All data in a db are stored in Data files.

Extends: Data files can automatically expand n predefined increments known as extends. DBA

can define that each new extend will be in 10KB or 10 MB increments if data grows.

Data files are generally grouped in file groups or table spaces.

Table Space or File group: is a logical grouping of several data files that store data with similar

characteristics. Each time you crate a new database, the DBMS automatically creates a min set of table

spaces.

Data cache or buffer cache is a shared, reserved memory area that stores the most recently accessed data

blocks in RAM. The data cache is where the data read from the db data files are stored after the data have

been read or before the data are written to the db data files. The data cache also caches system catalog data

and the contents of the indexes.

SQL cache or Procedure cache is a shared, reserved memory area that stores the most recently executed

SQL statements or PL/SQL procedures, including triggers and functions.

To work with the data, the DBMS must retrieve the data from permanent storage (data files in which the

data are stored) and place it in RMA (Data cache).

In order to move data from the permanent storage (data files) to the RAM (data cache) the DBMS issues

I/O requests and waits for the replies. An Input/Output (I/O) request is a low-level (read or write) data

access operation to and from computer devices, such as memory, hard disks, video, and printers.

Working with data in the data cache is many times faster than working with data in the data files because

the DBMS doesn‘t have to wait for the hard disk to retrieve the data. This is because no hard disk I/O ops

are needed to work within the data cache.

The majority of performance-tuning activities focus on minimizing the number of I/O ops because using

I/O ops is many times slower than reading data from the data cache.

Typical DBMS processes:

Listener: Listens for clients requests and handles the processing of the SQL requests to other

DBMS processes. Once the request is received, the listener passes the request to the appropriate

user process.

User: The DBMS creates a user process to manage each client session.

Scheduler: It organizes the concurrent execution of SQL requests. Transaction management and

concurrency control.

Lock Manager: This process manages all locks placed on db objects, including disk pages.

Optimizer: This process analyzes SQL queries and finds the most efficient way to access the data.

(Section 11.1.3) Database Statistics

Database Statistics: Another DBMS process that plays an important role in query optimization is

gathering db statistics. It refers to a number of measurements about db objects, such as number of

processors used, processor speed, and temporary space available. DB statistics can be manually gathered by

the DBA or automatically by the DBMS. Oracle, SQL server, and DB2 automatically gather statistics;

others require the DBA to gather them manually. DB statistics are stored in the system catalog.

(Section 11.2) Query Processes

A DBMS processes a query in 3 steps

Parsing: The DBMS parses the SQL query and chooses the most efficient access/execution plan.

Execution: The DBMS executes the SQL query using the chosen execution plan.

Fetching: The DBMS fetches the data and sends the result set back to the client.

(Section 11.2.1) SQL Parsing Phase

The optimization process includes breaking down –parsing- the query into smaller units and transforming

the original SQL query into a slightly different version of the original SQL code, but one that is fully

equivalent and more efficient.

Query Optimizer: The SQL parsing activities are performed by the query optimizer, which analyzes the

SQL query and finds the most efficient way to access the data. This process is the most time-consuming

phase in the query processing. Parsing SQL query requires the following steps.

Validated for syntax compliance

Validated against the data dictionary to ensure that tables and column names are correct.

Validated against the data dictionary to ensure that the user has proper access rights.

Analyzed and decomposed into more atomic components.

Optimized through transformation into a fully equivalent but more efficient SQL query.

Prepared for execution by determining the most efficient execution or access plan.

Once the SQL statement is transformed, the DBMS creates what is commonly known as an access or

execution plan.

Access Plan: is the result of parsing an SQL statement; it contains the series of steps a DBMS will use to

execute the query and to return the result set in the most efficient way. Access plans are DBMS-Specific

and translate the clients SQL query into the series of complex I/O ops required to read the data from the

physical data files and generate the result set.

First the DBMS checks to see if an access plan already exists for the query in the SQL cache.

If it does, the DBMS reuses the access plan to save time.

If it doesn‘t, the optimizer evaluates various plans and makes decision about what indexes to use

and how to best perform join ops.

The chosen access plan for the query is then placed in the SQL cache and made available for use

and future reuse.

Sample DBMS Access Plan I/O ops

o Table Scan (FULL) Reads entire table sequentially and is the Slowest.

o Table Access (Row ID) Fastest method. Reads a table row directly using the row id

value.

o Index Scan (Range) Reads the index first to obtain the row id and then accesses the table

rows directly (faster than a full table scan).

o Index Access (Unique) Used when a table has a unique index in a column.

o Nested Loop reads and compares a set of values to another set of values, using a nested

loop style (slow).

o Merge: merges two data sets (slow).

o Sort sorts a date set (slow).

(Section 11.2.2) SQL Execution Phase

All I/O ops indicated in the access plan are executed. When the execution plan is run, the proper locks if

needed are acquired for the data to be accessed, and the data are retrieved from the data files and placed in

the DBMSs data cache. All transaction management commands are processed during the parsing and

exaction phases of query processing.

(Section 11.2.3) SQL Fetching Phase

After the parsing and execution phases are completed, all rows that match the specified condition are

retrieved, sorted, grouped, and /or aggregated (if required). During the fetching phase, the rows of the

resulting query result set are returned to the client.

(Section 11.2.4) Query processing bottlenecks

Query Processing Bottleneck: is a delay introduced in the processing of an I/O operation that causes the

overall system to slow down. There are five components that typically cause bottlenecks:

CPU

RAM

Hard Disk

Network

Application coded

(Section 11.3) Indexes and Query optimization

Indexes are crucial in speeding up data access because they facilitate searching, sorting, and using

aggregated functions and even join ops. Indexes contain index key and pointers. When you use a book

index, you look up the word, similar to the index key, which is accompanied by the page numbers, similar

to the pointer, which direct you to the appropriate page. There are accesses to the index and accesses to the

data.

Why not index every column in every table? It‘s not practical to do so. Indexing every column in every

table taxes the DBMS too much in terms of index-maintenance processing, especially if the table has many

attributes; many rows; or requrie3s many inserts, updates, and deletes.

Data Sparsity refers to the number of different values a column could possibly have. It‘s a measure that

determines the need for an index. For example a Student_Sex table has only Male or Female (M or F), this

indicates low Sparsity therefore there is not need for indexing. In a Student_DOB, there are many different

date values which indicate high Sparsity. In this case indexing may be needed.

Most DBMS implement indexes using one of the following data structures:

Hash indexes: A hash algorithm is used to create a hash value from a key column. It‘s good for

simple and fast lookup operation.

B-Tree indexes: This is the default and most common type of index used in dbs. They are used

main in tables in which column values repeat a relative smaller number of times. It‘s an ordered

data structure organized as an upside down tree. They are ―self-balanced‖, which means that it

takes the same amount o access to find any given row in the index.

Bitmap indexes: Used in data warehouse app in tables with large number of rows in which a mall

number of column values repeat many times. They tend to use less space then B-tree because they

use bits instead of bytes to store their data.

(Section 11.4) Optimizer Choices

Query optimization is the central activity during the parsing phase in query processing. It can operate in

one of two modes:

A rule-Based Optimizer: Uses preset rules and points to determine the best approach to execute

a query. The rules assign a ―fixed cost‖ to each SQL operation.

A cost-Based Optimizer: Uses sophisticated algorithms based on the statistics about the objects

being accessed to determine the best approach to execute a query. In this case, the optimizer

process ads up the processing cost, the I/O costs, and the resource costs (RAM and temporary

space) to come up with the total cost of a given execution plan.

(Section 11.4.1) Using Hints to affect optimizer choices

If the statistics are old, the optimizer might not do a good job in selecting the best execution plan. There are

some occasions when the end user would like to change the optimizer mode for the current SQL statement.

In this case you need Optimizer Hints: which are special instructions for the optimizer that are embedded

inside the SQL command text.

(Section 11.5) Performance Tuning

Although a DBMS provides general optimizing services, a carefully written query almost always

outperforms a poorly written one.

(Section 11.5.1) Index Selectivity

Indexes are the most important technique used in SQL performance optimization. Indexes are likely to be

used:

When an indexed column appears by itself in a search criteria of the WHERE or HAVING clause.

When an indexed column appears by itself in a GROUP BY or ORDER BY clause.

When a MAX or MIN function is applied to an indexed column.

When the data sparsity on the indexed column is high.

The objective is to create indexes with high selectivity. Index selectivity is a measure of how likely an

index will be used in query processing.

Do not use indexes on small tables with minimal rows and columns. Also do not use indexes for

low sparsity tables.

Declare PK and FK and join operations. The declaration of a PK or FK will automatically create

an index for the declared column.

Declare indexes in join columns other than PK or FK.

Function Based Index: is an index based on a specific SQL function or expression. They are useful when

dealing with derived attributes.

(Section 11.5.2) Conditional Expression

A conditional expression is normally expressed within the WHERE or HAVING clauses of a SQL

statement. Also known as conditional criteria, a conditional expression restricts the output of a query to

only the rows matching the conditional criteria.

Use simple columns or literals as operands in a conditional expression.

Numeric field comparisons are faster than character, date, and NULL comparisons. In search

conditions, comparing a numeric attribute to a numeric literal is faster than comparing a character

attribute to a character literal. In general, the PCU handles numeric comparisons (integer and

Decimal) faster than character and date comparisons. Because indexes do not store references to

null values, null conditions involve additional processing, and therefore, tend to be the slowest of

all conditional operands.

Equality comparisons are faster than inequality comparisons: Equality symbol ―=‖ is faster than

the Inequality symbol ―<,>,>=, <=‖. ―LIKE‖ is the slowest of all operators.

Whenever possible, transform conditional expressions to use literals. Example P_PRICE – 10 = 7,

change it to read P_PRICE = 17

When using multiple conditional expressions, write the equality conditions first.

If you use multiple AND conditions, write the condition most likely to be false first.

When using multiple OR conditions, put the condition most likely to be true first.

Whenever possible, try to avoid the use of the NOT logical operator. Example: NOT (P_PRICE >

10.00) can be written as P_PRICE <= 10.00 or NOT (EMP_SEX = ‗M‘) can be written as

EMP_SEX = ‗F‘

(Section 11.7) DBMS Performance Tuning

Performance tuning includes global tasks such as managing the DBMS processes in primary memory

(allocating memory for caching purposes) and managing the structures in physical storage (allocating space

for the data files).

DBMS performance tuning at the server end focuses on setting the parameters used for:

Data Cache, the majority of primary memory resources will be allocated to the data cache.

SQL Cache, stores the most recently executed SQL statements (after the SQL statements have

been parsed by the optimizer). The DBMS will parse the query once and executed many times

using the same access plan.

Sort Cache, The sort cache is used as a temporary storage area for ORDER BY or GROUP BY

ops, as well as for index-creation functions.

Optimizer mode, most DBMSs operate in one of two optimization modes: cost-based or rule

based. Other automatically determines the optimization mode based on whether db statistics are

available. DBA is responsible for generating the db statistics that are used by the cost-based

optimizer. If the statistics are not available, the DMBS uses a rules-based optimizer.

Managing the physical storage details of the data files also plays an important role in DBMS performance

tuning.

RAID (redundant array of independent disks) to provide balance between performance and fault

tolerance. They use multiple disks to create virtual disk formed by several individual disks. They

also provide performance improvement and fault tolerance.

Minimize disk contention: Use multiple, independent storage volumes with independent spindles

or a rotating disk to minimize hard disk cycles.

o System table space: stores data dictionary

o User data table space: stores end-user data

o Index table space: stores indexes

o Temporary table space: this is used as a temporary storage area for merge, sort, or set

aggregate ops.

o Rollback segment table space: this is used for transaction recovery purposes.

Put high usage tables in their own table spaces. By doing this the db minimizes conflict with other

tables.

Assign separate data files in separate storage volumes for the indexes, system, and high-usage

tables. This ensures that index ops will not conflict with en-user data or data dictionary table

access ops.

Take advantage of the various table storage organizations available in the db. For example, in

Oracle consider the use of index organized tables (IOT); in SQL Server consider clustered index

tables. An Index Organized Table or Clustered Index Table is a table that stores the end user

data and the index data in consecutive location on permanent storage.

Partition tables based on usage.

Use denormalized tables where appropriate.

Store computed and aggregate attributes in tables. Use derived attributes in tables. Minimizes

computations in queries and join operations.

----------------------------------------------------------------------------------------------------------------------------- ----

Chapter 12: Distributed database management system

(Section 12.1) The evolution of distributed database management systems

Distributed database management system (DDBMS): DDBMS governs the storage and processing of

logically related data over interconnected computer systems in which both data and processing functions

are distributed among several sites.

Rapid ad hoc data access became crucial in the quick-response decision-making environment.

Decentralization of management structures based on the decentralization of business units made

decentralized multiple-access and multiple location db a necessity.

The decentralized db is especially desirable because centralized db management is subject to problems such

as:

Performance degradation due to a growing number of remote locations over greater distances.

High costs associated with maintaining and operating large central mainframes db systems.

Reliability problems created by dependence on a central site (single point of failure syndrome) and

the need for data replication.

Scalability problems associated with the physical limits imposed by a single location (power,

temperature conditioning, and power consumption.)

Organizational rigidity imposed by the db might not support the flexibility and agility required by

modern global organizations.

(Section 12.2) DDBMS Advantages and Disadvantages

Advantages:

Data are located near the greatest demand site.

Faster data access

Faster data processing

Growth facilitation

Improved communications

Reduced operating costs

User friendly interface

Process independence

Disadvantages:

Complexity of management and control

Technological difficulty

Security

Lack of standards

Increased storage and infrastructure requirements

Increased training costs

Costs

(Section 12.3) Distributed processing and distributed databases

In a Distributed processing, a databases logical processing is shared among two or more physically

independent sites that are connected through a network.

Distributed Database stores a logically related database over two or more physically independent sites.

In a distributed database system, a db is composed of several parts known as database fragments, which

they are located at the different sites and can be replicated among various sites.

Distributed processing does not require distributed database, but a distributed database requires distributed

processing.

Distributed processing may be based on a single db located on a single computer. For the management of

distributed data to occur, copies or parts of the db predestining functions must be distributed to all data

storage sites.

Both distributed processing and distributed databases require a network to connect all components.

(Section 12.4) Characteristics of distributed database management systems

A DBMS must have at least the following functions to be classified as distributed:

App interface to interact with the end user, app programs, and other DBMSs within the distributed

db.

Validation to analyze data requests for syntax correctness

Transformation to decompose complex requests into atomic data request components

Query optimization to find the best access strategy. (Which db fragments must be accessed by the

query, and how must data update, if any, be synchronized?)

Mapping to determine the data location of local and remote fragments.

I/O interface to read or write data from or to permanent local storage.

Formatting to prepare the data for presentation to the end user or to an app program

Security to provide data privacy at both local and remote db.

Backup and recovery to ensure the availability and recoverability of the db in case of a failure.

DB admin features for the db admin.

Concurrency control to manage simultaneous data access and to ensure data consistency across

database fragments in the DDBMS.

Transaction management to ensure that the data moves from one consistent state to another. This

activity includes the synchronization of local and remote transactions as well as transitions across

multiple distributed segments.

A fully distributed management system must perform all of the functions of a centralized DBMS, as

follows:

Receive an app or an end users request

Validate, analyze, and decompose the request. The request might include mathematical and/or

logical operations.

Map the request‘s logical to physical data components.

Decompose the request into several disk I/O operations.

Search for, locate, read, and validate the data.

Ensure database consistency, security, and integrity.

Validate the data for the conditions, if any, specified by the request.

Present the selected data in the required format.

(Section 12.5) DDBMS components

DDBMS must include at least the following components

Computer workstations

Network hardware and software components that reside in each workstation.

Communications media that carry the data from one workstation to another.

The transaction processor (TP), which is the software component, found in each computer that

requests data. The transaction processor receives and processes the application data requests

(remote and local). The TP is also known as the application processor (AP) or the transaction

manager (TM).

The data processor (DP), which is the software component residing on each computer that stores

and retrieves data located at the site. The DP is also known as the data manager (DM). A data

processor may even be a centralized DBMS.

The protocols determine how the distributed database system will:

Interface with the network to transport data and commands between data processors (DPs) and

transaction processors (TPs).

Synchronize all data received from DP (TP side) and route retrieved data to the appropriate TPs

(DP side).

Ensure common database functions in a distributed system. Such functions include security,

concurrency control, backup, and recovery.

(Section 12.6) Levels of data and process distribution

Single-site process: Host DBMS: Not applicable (Requires multiple processes)

Multiple-site process: File server Client/server DBMS (LAN DBMS): Fully distributed client/server

DDBMS.

(Section 12.6.1) Single-Site processing, Single-Site Data (SPSD)

In the single-site processing, single-site data (SPSD) scenario, all processing is done on a single host

computer (single-processor server, multiprocessor server, mainframe system) and all data are stored on the

host computers local disk system.

(Section 12.6.2) Multiple-Site processing, Single-Site Data (MPSD)

Multiple-site processing, sing-site data (MPSD) scenario, multiple processes run on different computers

sharing a single data repository. Typically, the MPSD scenario requires a network file server running

conventional applications that are accessed through a network. MPSD offers limited capabilities for

distributed processing. All data selection, search, and update functions take place at the workstation, thus

requiring that entire files travel though the network for processing at the workstation.

Single site data approach is known as client/server architecture. Client/server architecture is similar to

that of the network file server except that all db processing is done at the server site, thus reducing network

traffic.

(Section 12.6.3) Multiple Site Processing, Multiple Site Data (MPMD)

Multiple site processing, multiple site data (MPMD) scenario describes a fully distributed DBMS with

support for multiple data processors and transaction processors at multiple sites.

Homogeneous DDBMS integrate only one type of centralized DBMS over a network.

Heterogeneous DDBMS integrate different types of centralized DBMS over a network.

Fully Heterogeneous DDBMS will support different DBMS that may even support different data models

(relational, hierarchical, or network) running under different computer systems, such as mainframes and

PCs.

(Section 12.7) Distributed Database Transparency Features

The DDBMS transparency features are Distribution Transparency, which allows a distributed db to be

treated as a single logical db. If a DDBMS exhibits distribution transparency, the user does not need to

know:

That the data are partitioned

That the data can be replicated at several sites.

The data location.

Transaction transparency, which allows a transaction to update data at more than one network

site. It ensures that the transaction will be either entirely completed or aborted to maintain db

integrity.

Failure transparency, which ensures that the system will continue to operate in the event of a

node failure.

Performance transparency, which allows the system to perform as if it were a centralized

DBMS. It also ensures that the system will find the most cost effective path to access remote data.

Heterogeneity Transparency, which allows the integration of several different local DBMS

(relational, network, and hierarchical) under a common, or global, schema. The DDBMS is

responsible for translating the data requests from the global schema to the local DBMS schema.

(Section 12.8) Distribution Transparency

Distribution transparency allows a physically dispersed db to be managed as though it were a centralized

db. Three level of distribution transparency:

Fragmentation transparency is the highest level of transparency. The end user or programmer

does not need to know that a db is partitioned.

Location transparency exists when the end user or programmer must specify the db fragment

names but does not need to specify where those fragments are located.

Local Mapping transparency exists when the end user or programmer must specify both the

fragment names and their locations.

Unique Fragment: Indicates that each row is unique, regardless of the fragment in which it is located.

Distributed Data Dictionary (DDD) or Distributed Data catalog (DDC): DDD or DDC supports

distribution transparency. The DDD contains the description of the entire database as seen by the db admin.

It is distributed and replicated at the network nodes.

Distribution global schema: is the common db schema used by local TPs to translate user requests into

sub-queries (remote requests) that will be processed by different DPs.

(Section 12.9) Transaction Transparency

Transaction transparency is a DDBMS property that ensures that db transactions will maintain the

distributed db integrity and consistency. Transaction transparency ensures that the transaction will be

completed only when all db sites involved in the transaction complete their part of the transaction.

(Section 12.9.1) Distributed Requests and Distributed Transactions

Remote Request: Lets a single SQL statement access the data that are to be processed by a single remote

db processor. In other words, the SQL statement or request can reference data at only one remote site.

Remote transactions: composed of several requests, accesses data at a single remote site.

Distribution Transaction: allows a transaction to reference several different local or remote DP sites.

Distributed Request lets a single SQL statement reference data located at several different local or remote

DP sites.

(Section 12.9.2) Distributed Concurrency Control

TP component of a DDBMS must ensure that all parts of the transaction are completed at all sites before a

final COMMIT is issued to record the transaction.

(Section 12.9.3) Two-phase Commit Protocol

Centralized db require only one DP. With a distributed db environment a final COMMIT must not be

issued until all sites have committed their parts of the transaction.

Two-phase commit protocol guarantees that if a portion of a transaction operation cannot be committed.

all changes made at the other sites participating in the transaction will be undone to maintain a consistent

database state. Each DP maintains its own transaction log. The two-phase commit protocol requires that the

transaction entry log for each DP be written before the db fragment is actually updated. Therefore the two-

phase commit protocol requires a DO-UNDO-REDO and a write-ahead protocol.

DO-UNDO-REDO protocol is used by the DP to roll back and / or roll forward transactions with the help

of the systems transaction log entries. This protocol defines three types of operations.

DO performs the operation and records the ―before‖ and ―after‖ values in the transaction log.

UNDO reverses an operation, using the log entries written by the DO portion of the sequence.

REDO redoes an operation, using the log entries written by the DO portion of the sequence.

To ensure that a DO, UNDO, and REDO operations can survive a system crash while they are being

executed, a write-ahead protocol is used.

Write-ahead protocol forces the log entry to be written to permanent storage before the actual operation

takes place.

Two-phase commit protocol defines the operations between two types of nodes: Coordinator and one or

more subordinates, or Cohorts. The participating nodes agree on a coordinator.

Phase 1: Preparation The coordinator sends a PREPARE TO COMMIT message to all subordinates.

The subordinates receive the message; write the transaction log, using the write-ahead protocol;

and send an acknowledgement (YES/PREPARED TO COMMIT OR NO/NOT PREPARED)

message to the coordinator.

The coordinator makes sure that all nodes are ready to commit, or it aborts the action.

Phase 2: The final COMMIT

The coordinator broadcasts a COMMIT message to all subordinates and waits for the replies.

Each subordinate receives the COMMIT message and the updates the db using the DO protocol.

The subordinates reply with a COMMITTED or NOT COMMITTED message to the coordinator.

(Section 12.10) Performance transparency and query optimization

The objective of the query optimization routine is to minimize the total cost associated with the execution

of a request.

Replica transparency refers to the DDBMS ability to hide the existence of multiple copies of data from

the user.

Automatic query optimization means that the DDBMS finds the most cost-effective access path without

user intervention. It‘s the most desired from the end users point of view.

Manual query optimization requires that the optimization be selected and scheduled by the end user or

programmer.

Query optimization algorithms can be classified as static or dynamic.

Static query optimization takes place at compilation time.

Dynamic Query optimization takes place at execution time.

Statistically based query optimization algorithm uses statistical information about the db. The statistics

provide info about db characteristics such as size, number of records, average access time, number of

requests service, and number of users with access rights. These statistics are then used by the DBMS to

determine the best access strategy.

Dynamic statistical generation mode, the DDBMS automatically evaluates and updates the statistics after

each access.

Manual statistical generation mode, the statistics must be updated periodically through a user-selected

utility such as IBM RUNSTAT command used by DB2 DBMS.

Rule based query optimization algorithm is based on the set of user defined rules to determine the best

query access strategy.

(Section 12.11) Distributed database design

How to partition the db into fragments

Which fragments to replicate

Where to locate those fragments and replicas

(Section 12.11.1) Data Fragmentation

Data fragmentation allows you to break a single object into two or more segments or fragments. There are

three types:

Horizontal fragmentation refers to the division of a relation into subsets (fragments) of tuples.

Each fragment is stored at a different node.

Vertical fragmentation refers to the division of a relation into attribute subsets. Each subset

(fragment) is stored at a different node.

Mixed fragmentation refers to a combination of horizontal and vertical strategies.

(Section 12.11.2) Data Replication

Data replication refers to the storage of data copies at multiple sites served by a computer network.

Mutual consistency rule requires that all copies of data fragments be identical.

Three replication scenarios exist:

Fully replicated database stores multiple copies of each database fragment at multiple sites. IT

can be impractical due to the amount of overhead it imposes on the system.

Partially replicated database stores multiple copies of some database fragments at multiple sites.

Most DDBMS are able to handle this.

Un-replicated database stores each database fragment at a single site. Therefore there are no

duplicate db fragments.

Sever factors influence the decision to use data replication:

Database size

Usage frequency

Costs

(Section 12.11.3) Data allocation

Data allocation describes the process of deciding where to locate data. Its strategies as follows:

Centralized data allocation , the entire db is stored at one site

Partitioned data allocation, the db is divided into two or more disjointed parts (fragments) and

stored at two or more sites.

Replicated data allocation, with this allocation, copies of one or more db fragments are stored at

several sites.

----------------------------------------------------------------------------------------------------------------------- ----------

Chapter 13: Business Intelligence and Data Warehouses

(Section 13.1) The need for data analysis

Data analysis can provide info about sort term tactical evaluations and strategies such as: Are our sales

promotions working? What market percentage are we controlling? Are we attracting new customer? It‘s a

method used to gain competitive advantage. It is essential for effective decision making to stage

competitive. This more comprehensive and integrated decision support framework within organizations

became known as business intelligence.

(Section 13.2) Business Intelligence

Business Intelligence (BI)

database design and implementation

Documents

data management

stores data

enduser data

data element

processing data

data security

data integration

retrieval of data