ralated terms of dbms

26
DBMS

Upload: profhitesh-mohapatra

Post on 18-Nov-2014

112 views

Category:

Documents


0 download

DESCRIPTION

some concept about DBMS

TRANSCRIPT

Page 1: Ralated Terms of DBMS

DBMS

Page 2: Ralated Terms of DBMS

Organization of Records in FilesThere are several ways of organizing records in files.

heap file organization. Any record can be placed anywhere in the file where there is space for the record. There is no ordering of records.

sequential file organization. Records are stored in sequential order, based on the value of the search key of each record.

hashing file organization. A hash function is computed on some attribute of each record. The result of the function specifies in which block of the file the record should be placed -- to be discussed in chapter 11 since it is closely related to the indexing structure.

clustering file organization. Records of several different relations can be stored in the same file. Related records of the different relations are stored on the same block so that one I/O operation fetches related records from all the relations.

Sequential File Organization1. A sequential file is designed for efficient processing of records in sorted order

on some search key. o Records are chained together by pointers to permit fast retrieval in search

key order. o Pointer points to next record in order. o Records are stored physically in search key order (or as close to this as

possible). o This minimizes number of block accesses. o Figure 10.15 shows an example, with bname as the search key.

2. It is difficult to maintain physical sequential order as records are inserted and deleted.

o Deletion can be managed with the pointer chains. o Insertion poses problems if no space where new record should go. o If space, use it, else put new record in an overflow block. o Adjust pointers accordingly. o Figure 10.16 shows the previous example after an insertion. o Problem: we now have some records out of physical sequential order. o If very few records in overflow blocks, this will work well. o If order is lost, reorganize the file. o Reorganizations are expensive and done when system load is low.

3. If insertions rarely occur, we could keep the file in physically sorted order and reorganize when insertion occurs. In this case, the pointer fields are no longer required.

Clustering File Organization1. One relation per file, with fixed-length record, is good for small databases, which

also reduces the code size. 2. Many large-scale DB systems do not rely directly on the underlying operating

system for file management. One large OS file is allocated to DB system and all relations are stored in one file.

Page 3: Ralated Terms of DBMS

3. To efficiently execute queries involving , one may store the depositor tuple for each cname near the customer tuple for the corresponding cname, as shown in Figure 10.19.

4. This structure mixes together tuples from two relations, but allows for efficient processing of the join.

5. If the customer has many accounts which cannot fit in one block, the remaining records appear on nearby blocks. This file structure, called clustering, allows us to read many of the required records using one block read.

6. Our use of clustering enhances the processing of a particular join but may result in slow processing of other types of queries, such as selection on customer.

For example, the query

aaaaaaaaaaaa¯select * from customer

now requires more block accesses as our customer relation is now interspersed with the deposit relation.

7. Thus it is a trade-off, depending on the types of query that the database designer believes to be most frequent. Careful use of clustering may produce significant performance gain.

Data Dictionary Storage1. The database also needs to store information about the relations, known as the

data dictionary. This includes: o Names of relations. o Names of attributes of relations. o Domains and lengths of attributes. o Names and definitions of views. o Integrity constraints (e.g., key constraints).

plus data on the system users:

o Names of authorized users. o Accounting information about users.

plus (possibly) statistical and descriptive data:

o Number of tuples in each relation. o Method of storage used for each relation (e.g., clustered or non-clustered).

2. When we look at indices (Chapter 11), we'll also see a need to store information about each index on each relation:

o Name of the index. o Name of the relation being indexed. o Attributes the index is on. o Type of index.

Page 4: Ralated Terms of DBMS

3. This information is, in itself, a miniature database. We can use the database to store data about itself, simplifying the overall structure of the system, and allowing the full power of the database to be used to permit fast access to system data.

4. The exact choice of how to represent system data using relations must be made by the system designer. One possible representation follows.

5. aaaaaaaaaaaa¯System-catalog-schema = (relation-name, number-attrs)

6.

7. Attr-schema = (attr-name, rel-name, domain-type, position, length)

8.

9. User-schema = (user-name, encrypted-password, group)

10.

11. Index-schema = (index-name, rel-name, index-type, index-attr)

12.

13. View-schema = (view-name, definition)

2.4 Database Systems: Mapping Constraints

An E-R scheme may de ne certain constraints to which the contents of a database must conform.

Mapping Cardinalities: express the number of entities to which another entity can be

associated via a relationship. For binary relationship sets between entity sets A and B, the mapping

cardinality must be one of:

1. One-to-one: An entity in A is associated with at most one entity in B, and an entity in B is

associated with at most one entity in A. (Figure 2.3)

2. One-to-many: An entity in A is associated with any number in B. An entity in B is associated with

at most one entity in A. (Figure 2.4)

3. Many-to-one: An entity in A is associated with at most one entity in B. An entity in B is associated

with any number in A. (Figure 2.5)

4. Many-to-many: Entities in A and B are associated with any number from each other. (Figure 2.6)

The appropriate mapping cardinality for a particular relationship set depends on the real world being

modeled.

(Think about the Cust Acct relationship...)

Page 5: Ralated Terms of DBMS

Existence Dependencies: if the existence of entity X depends on the existence of entity Y,

then X is said to be existence dependent on Y. (Or we say that Y is the dominant entity and X is

the subordinate entity.)

For example,

- Consider account and transaction entity sets, and a relationship log between them.

- This is one-to-many from account to transaction.

- If an account entity is deleted, its associated transaction entities must also be deleted.

- Thus account is dominant and transaction is subordinate.

Database Systems: Overview of Physical Storage Media

1. Several types of data storage exist in most computer systems. They vary in speed of access, cost

per unit of data, and reliability.

Cache: most costly and fastest form of storage. Usually very small, and managed by the

operating system.

Main Memory (MM): the storage area for data available to be operated on.

-General-purpose machine instructions operate on main memory.

-Contents of main memory are usually lost in a power failure or "crash".

-Usually too small (even with megabytes) and too expensive to store the entire database.

Flash memory: EEPROM (electrically erasable programmable read-only memory).

-Data in flash memory survive from power failure.

-Reading data from flash memory takes about 10 nano-secs (roughly as fast as from main

memory),and writing data into flash memory is more complicated: write-once takes about 4-10

microsecs.

-To overwrite what has been written, one has to rst erase the entire bank of the memory. It may

support only a limited number of erase cycles (104 to 106).

-It has found its popularity as a replacement for disks for storing small volumes of data (5-10

megabytes).

Magnetic-disk storage: primary medium for long-term storage.

-Typically the entire database is stored on disk.

-Data must be moved from disk to main memory in order for the data to be operated on.

-After operations are performed, data must be copied back to disk if any changes were made.

-Disk storage is called direct access storage as it is possible to read data on the disk in any order

(unlike sequential access).

Page 6: Ralated Terms of DBMS

-Disk storage usually survives power failures and system crashes.

Optical storage: CD-ROM (compact-disk read-only memory), WORM (write-once read-many)

disk (for archival storage of data), and Juke box (containing a few drives and numerous disks loaded

on demand).

Tape Storage: used primarily for backup and archival data.

-Cheaper, but much slower access, since tape must be read sequentially from the beginning.

-Used as protection from disk failures!

2. The storage device hierarchy is presented in Figure 10.1, where the higher levels are expensive

(cost per bit),fast (access time), but the capacity is smaller.

3. Another classi cation: Primary, secondary, and tertiary storage.

(a) Primary storage: the fastest storage media, such as cash and main memory.(b) Secondary (or on-line) storage: the next level of the hierarchy, e.g., magnetic disks.(c) Tertiary (or o -line) storage: magnetic tapes and optical disk juke boxes.

Page 7: Ralated Terms of DBMS

4. Volatility of storage. Volatile storage loses its contents when the power is removed. Without power

backup,data in the volatile storage (the part of the hierarchy from main memory up) must be written to

nonvolatile storage for safekeeping.

Magnetic Disk

Database Systems: Physical Characteristics of Disks

1. The storage capacity of a single disk ranges from 10MB to 10GB. A typical commercial database

may require hundreds of disks.

2. Figure 10.2 shows a moving-head disk mechanism.

Each disk platter has a at circular shape. Its two surfaces are covered with a magnetic material and

information is recorded on the surfaces. The platter of hard disks are made from rigid metal or

glass,while oppy disks are made from exible material.

The disk surface is logically divided into tracks, which are subdivided into sectors. A sector (varying

from 32 bytes to 4096 bytes, usually 512 bytes) is the smallest unit of information that can be read

from or written to disk. There are 4-32 sectors per track and 20-1500 tracks per disk surface.

The arm can be positioned over any one of the tracks.

The platter is spun at high speed.

To read information, the arm is positioned over the correct track.

When the data to be accessed passes under the head, the read or write operation is performed.

3. A disk typically contains multiple platters (see Figure 10.2). The read-write heads of all the tracks

are mounted on a single assembly called a disk arm, and move together.

Multiple disk arms are moved as a unit by the actuator.

Each arm has two heads, to read disks above and below it.

The set of tracks over which the heads are located forms a cylinder.

This cylinder holds that data that is accessible within the disk latency time.

It is clearly sensible to store related data in the same or adjacent cylinders.

4. Disk platters range from 1.8" to 14" in diameter, and 5"1/4 and 3"1/2 disks dominate due to the

lower cost and faster seek time than do larger disks, yet they provide high storage capacity.

Page 8: Ralated Terms of DBMS

5. A disk controller interfaces between the computer system and the actual hardware of the disk drive.

It accepts commands to r/w a sector, and initiate actions. Disk controllers also attach checksums to

each sector to check read error.

6. Remapping of bad sectors: If a controller detects that a sector is damaged when the disk is initially

formatted,or when an attempt is made to write the sector, it can logically map the sector to a di erent

physical location.

7. SCSI (Small Computer System Interconnect) is commonly used to connect disks to PCs and

workstations.Mainframe and server systems usually have a faster and more expensive bus to connect

to the disks.

8. Head crash: why cause the entire disk failing (?).

9. A fixed dead disk has a separate head for each track | very many heads, very expensive. Multiple

disk arms:allow more than one track to be accessed at a time. Both were used in high performance

mainframe systems but are relatively rare today.

Lecture Notes: Relational AlgebraDet finns inget kapitel om relationsalgebra i kursen. Jag hade först tänkt ha med ett, men relationsalgebra passar inte riktigt i en grundkurs som den här. I stället finns en kort förklaring i ordlistan, och för den som vill läsa mer finns dessutom dessa föreläsningsanteckningar på engelska.

What? Why? Similar to normal algebra (as in 2+3*x-y), except we use relations as values

instead of numbers, and the operations and operators are different. Not used as a query language in actual DBMSs. (SQL instead.) The inner, lower-level operations of a relational DBMS are, or are similar to,

relational algebra operations. We need to know about relational algebra to understand query execution and optimization in a relational DBMS.

Some advanced SQL queries requires explicit relational algebra operations, most commonly outer join.

Relations are seen as sets of tuples, which means that no duplicates are allowed. SQL behaves differently in some cases. Remember the SQL keyword distinct.

SQL is declarative, which means that you tell the DBMS what you want, but not how it is to be calculated. A C++ or Java program is procedural, which means that you have to state, step by step, exactly how the result should be calculated. Relational algebra is (more) procedural than SQL. (Actually, relational algebra is mathematical expressions.)

Page 9: Ralated Terms of DBMS

Set operationsRelations in relational algebra are seen as sets of tuples, so we can use basic set operations.

Review of concepts and operations from set theory set element no duplicate elements (but: multiset = bag) no order among the elements (but: ordered set) subset proper subset (with fewer elements) superset union intersection set difference cartesian product

ProjectionExample: The table E (for EMPLOYEE)

nr name salary

1 John 100

5 Sarah 300

7 Tom 100

SQL Result Relational algebra

select salaryfrom E

salary

100

300

PROJECTsalary(E)

select nr, salaryfrom E

nr salary

1 100

5 300

7 100

PROJECTnr, salary(E)

Note that there are no duplicate rows in the result.

SelectionThe same table E (for EMPLOYEE) as above.

SQL Result Relational algebra

select *from Ewhere salary < 200

nr name salary

1 John 100

7 Tom 100

SELECTsalary < 200(E)

Page 10: Ralated Terms of DBMS

select *from Ewhere salary < 200and nr >= 7

nr name salary

7 Tom 100SELECTsalary < 200 and nr >= 7(E)

Note that the select operation in relational algebra has nothing to do with the SQL keyword select. Selection in relational algebra returns those tuples in a relation that fulfil a condition, while the SQL keyword select means "here comes an SQL statement".

Relational algebra expressions

SQL Result Relational algebra

select name, salaryfrom Ewhere salary < 200

name salary

John 100

Tom 100

PROJECTname, salary (SELECTsalary < 200(E)) or, step by step, using an intermediate result

Temp <- SELECTsalary < 200(E) Result <- PROJECTname, salary(Temp)

NotationThe operations have their own symbols. The symbols are hard to write in HTML that works with all browsers, so I'm writing PROJECT etc here. The real symbols:

Operation My HTML Symbol

Projection PROJECT

Selection SELECT

Renaming RENAME

Union UNION

Intersection INTERSECTION

Assignment <-

    Operation My HTML Symbol

Cartesian product

X

Join JOIN

Left outer joinLEFT OUTER

JOIN

Right outer join

RIGHT OUTER JOIN

Full outer joinFULL OUTER

JOIN

Semijoin SEMIJOIN

Example: The relational algebra expression which I would here write as

PROJECTNamn ( SELECTMedlemsnummer < 3 ( Medlem ) )

should actually be written

Page 11: Ralated Terms of DBMS

Cartesian productThe cartesian product of two tables combines each row in one table with each row in the other table.

Example: The table E (for EMPLOYEE)

enr ename dept

1 Bill A

2 Sarah C

3 John A

Example: The table D (for DEPARTMENT)

dnr dname

A Marketing

B Sales

C Legal

SQL Result Relational algebra

select *from E, D

enr ename dept dnr dname

1 Bill A A Marketing

1 Bill A B Sales

1 Bill A C Legal

2 Sarah C A Marketing

2 Sarah C B Sales

2 Sarah C C Legal

3 John A A Marketing

3 John A B Sales

3 John A C Legal

E X D

Seldom useful in practice. Usually an error. Can give a huge result.

Join (sometimes called "inner join")The cartesian product example above combined each employee with each department. If we only keep those lines where the dept attribute for the employee is equal to the dnr (the department number) of the department, we get a nice list of the employees, and the department that each employee works for:

SQL Result Relational algebraselect *from E, D enr ename dept dnr dname SELECTdept = dnr (E X D)

or, using the equivalent join

Page 12: Ralated Terms of DBMS

where dept = dnr

1 Bill A A Marketing

2 Sarah C C Legal

3 John A A Marketing

operation

E JOINdept = dnr D

A very common and useful operation. Equivalent to a cartesian product followed by a select. Inside a relational DBMS, it is usually much more efficient to calculate a join

directly, instead of calculating a cartesian product and then throwing away most of the lines.

Note that the same SQL query can be translated to several different relational algebra expressions, which all give the same result.

If we assume that these relational algebra expressions are executed, inside a relational DBMS which uses relational algebra operations as its lower-level internal operations, different relational algebra expressions can take very different time (and memory) to execute.

Natural joinA normal inner join, but using the join condition that columns with the same names should be equal. Duplicate columns are removed.

Renaming tables and columnsExample: The table E (for EMPLOYEE)

nr name dept

1 Bill A

2 Sarah C

3 John A

Example: The table D (for DEPARTMENT)

nr name

A Marketing

B Sales

C Legal

We want to join these tables, but:

Several columns in the result will have the same name (nr and name). How do we express the join condition, when there are two columns called nr?

Solutions:

Rename the attributes, using the rename operator. Keep the names, and prefix them with the table name, as is done in SQL. (This is

somewhat unorthodox.)

Page 13: Ralated Terms of DBMS

SQL Result Relational algebraselect *from E as E(enr, ename, dept), D as D(dnr, dname)where dept = dnr

enr ename dept dnr dname

1 Bill A A Marketing

2 Sarah C C Legal

3 John A A Marketing

(RENAME(enr, ename, dept)(E)) JOINdept = dnr (RENAME(dnr, dname)

(D))

select *from E, Dwhere dept = D.nr

nr name dept nr name

1 Bill A A Marketing

2 Sarah C C Legal

3 John A A Marketing

E JOINdept = D.nr D

You can use another variant of the renaming operator to change the name of a table, for example to change the name of E to R. This is necessary when joining a table with itself (see below).

RENAMER(E) A third variant lets you rename both the table and the columns: RENAMER(enr, ename, dept)(E)

Aggregate functionsExample: The table E (for EMPLOYEE)

nr name salary dept

1 John 100 A

5 Sarah 300 C

7 Tom 100 A

12 Anne null C

SQL Result Relational algebra

select sum(salary)from E

sum

500Fsum(salary)(E)

Note:

Duplicates are not eliminated. Null values are ignored.

SQL Result Relational algebraselect count(salary)from E

Result: Fcount(salary)(E)

Page 14: Ralated Terms of DBMS

count

3

select count(distinct salary)from E

Result:

count

2Fcount(salary)(PROJECTsalary(E))

You can calculate aggregates "grouped by" something:

SQL Result Relational algebra

select sum(salary)from Egroup by dept

dept sum

A 200

C 300

deptFsum(salary)(E)

Several aggregates simultaneously:

SQL Result Relational algebra

select sum(salary), count(*)from Egroup by dept

dept sum count

A 200 2

C 300 1

deptFsum(salary), count(*)(E)

Standard aggregate functions: sum, count, avg, min, max

HierarchiesExample: The table E (for EMPLOYEE)

nr name mgr

1 Gretchen null

2 Bob 1

5 Anne 2

6 John 2

3 Hulda 1

4 Hjalmar 1

7 Usama 4

Going up in the hierarchy one level: What's the name of John's boss?

Page 15: Ralated Terms of DBMS

SQL Result Relational algebra

select b.namefrom E p, E bwhere p.mgr = b.nrand p.name = "John"

name

Bob

PROJECTbname ([SELECTpname = "John"(RENAMEP(pnr, pname, pmgr)(E))] JOINpmgr = bnr [RENAMEB(bnr, bname, bmgr)(E)])

or, in a less wide-spread notation

PROJECTb.name ([SELECTname = "John"(RENAMEP(E))] JOINp.mgr =

b.nr [RENAMEB(E)])

or, step by step

P <- RENAMEP(pnr, pname, pmgr)(E) B <- RENAMEB(bnr, bname, bmgr)(E) J <- SELECTname = "John"(P) C <- J JOINpmgr = bnr B R <- PROJECTbname(C)

Notes about renaming:

We are joining E with itself, both in the SQL query and in the relational algebra expression: it's like joining two tables with the same name and the same attribute names.

Therefore, some renaming is required. RENAMEP(E) JOIN... RENAMEB(E) is a start, but then we still have the same

attribute names.

Going up in the hierarchy two levels: What's the name of John's boss' boss?

SQL Result Relational algebra

select ob.namefrom E p, E b, E obwhere b.mgr = ob.nrwhere p.mgr = b.nrand p.name = "John"

name

Gretchen

PROJECTob.name (([SELECTname = "John"(RENAMEP(E))] JOINp.mgr = b.nr [RENAMEB(E)]) JOINb.mgr = ob.nr [RENAMEOB(E)])

or, step by step

P <- RENAMEP(pnr, pname, pmgr)(E) B <- RENAMEB(bnr, bname, bmgr)(E) OB <- RENAMEOB(obnr, obname, obmgr)(E) J <- SELECTname = "John"(P) C1 <- J JOINpmgr = bnr B C2 <- C1 JOINbmgr = bbnr OB R <- PROJECTobname(C2)

Recursive closureBoth one and two levels up: What's the name of John's boss, and of John's boss' boss?

SQL Result Relational algebra

Page 16: Ralated Terms of DBMS

(select b.name ...)union(select ob.name ...)

name

Bob

Gretchen

(...) UNION (...)

Recursively: What's the name of all John's bosses? (One, two, three, four or more levels.)

Not possible in (conventional) relational algebra, but a special operation called transitive closure has been proposed.

Not possible in (standard) SQL (SQL2), but in SQL3, and using SQL + a host language with loops or recursion.

Outer joinExample: The table E (for EMPLOYEE)

enr ename dept

1 Bill A

2 Sarah C

3 John A

Example: The table D (for DEPARTMENT)

dnr dname

A Marketing

B Sales

C Legal

List each employee together with the department he or she works at:

SQL ResultRelational

algebraselect *from E, Dwhere edept = dnr

or, using an explicit joinselect *from (E join D on edept = dnr)

enr ename dept dnr dname

1 Bill A A Marketing

2 Sarah C C Legal

3 John A A Marketing

E JOINedept = dnr D

No employee works at department B, Sales, so it is not present in the result. This is probably not a problem in this case. But what if we want to know the number of employees at each department?

Page 17: Ralated Terms of DBMS

SQL Result Relational algebraselect dnr, dname, count(*)from E, Dwhere edept = dnrgroup by dnr, dname

or, using an explicit joinselect dnr, dname, count(*)from (E join D on edept = dnr)group by dnr, dname

dnr dname count

A Marketing 2

C Legal 1

dnr, dnameFcount(*)(E JOINedept = dnr D)

No employee works at department B, Sales, so it is not present in the result. It disappeared already in the join, so the aggregate function never sees it. But what if we want it in the result, with the right number of employees (zero)?

Use a right outer join, which keeps all the rows from the right table. If a row can't be connected to any of the rows from the left table according to the join condition, null values are used:

SQL Result Relational algebra

select *from (E right outer join D on edept = dnr)

enr ename dept dnr dname

1 Bill A A Marketing

2 Sarah C C Legal

3 John A A Marketing

null null null B Sales

E RIGHT OUTER JOINedept = dnr D

select dnr, dname, count(*)from (E right outer join D on edept = dnr)group by dnr, dname

dnr dname count

A Marketing 2

B Sales 1

C Legal 1

dnr, dnameFcount(*)(E RIGHT OUTER JOINedept = dnr D)

select dnr, dname, count(enr)from (E right outer join D on edept = dnr)group by dnr, dname

dnr dname count

A Marketing 2

B Sales 0

C Legal 1

dnr, dnameFcount(enr)(E RIGHT OUTER JOINedept = dnr D)

Join types:

JOIN = "normal" join = inner join LEFT OUTER JOIN = left outer join RIGHT OUTER JOIN = right outer join FULL OUTER JOIN = full outer join

Page 18: Ralated Terms of DBMS

Outer unionOuter union can be used to calculate the union of two relations that are partially union compatible. Not very common.

Example: The table R

A B

1 2

3 4

Example: The table S

B C

4 5

6 7

The result of an outer union between R and S:

A B C

1 2 null

3 4 5

null 6 7

DivisionWho works on (at least) all the projects that Bob works on?

SemijoinA join where the result only contains the columns from one of the joined tables. Useful in distributed databases, so we don't have to send as much data over the network.

UpdateTo update a named relation, just give the variable a new value. To add all the rows in relation N to the relation R:

R <- R UNION N

Domain relational calculus

Page 19: Ralated Terms of DBMS

In computer science, domain relational calculus (DRC) is a calculus that was ntroduced by Michel Lacroix and Alain Pirotte as a declarative database query language for the relational data model

In DRC, queries have the form:

where each Xi is either a domain variable or constant, and p(<X1, X2, ...., Xn>) denotes a DRC formula. The result of the query is the set of tuples Xi to Xn which makes the DRC formula true.

This language uses the same operators as tuple calculus, the logical connectives ∧ (and), ∨ (or) and ¬ (not). The existential quantifier (∃) and the universal quantifier (∀) can be used to bind the variables.

Its computational expresivity is equivalent to that of Relational algebra.

Examples

Let (A, B, C) mean (Rank, Name, ID)

and (D, E, F) to mean (Name, DeptName, ID)

In this example, A, B, C denotes both the result set and a set in the table Enterprise.

Find Names of Enterprise crewmembers who are in Stellar Cartography:

In this example, we're only looking for the name, and that's B. F = C is a requirement, because we need to find Enterprise crew members AND they are in the Stellar Cartography Department.

An alternate representation of the previous example would be:

In this example, the value of the requested F domain is directly placed in the formula and the C domain variable is re-used in the query for the existence of a department, since it already holds a crew member's id.