christoph f. eick introduction data management this week 1. introduction to databases 2. course...

25
Christoph F. Eick Introduction Data Management This Week 1.Introduction to Databases 2.Course Information 3.Grading and Other Things 4. Questionnaire 5. The E/R Data Model

Upload: patricia-hampton

Post on 03-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

This Week

1. Introduction to Databases

2. Course Information

3. Grading and Other Things

4. Questionnaire

5. The E/R Data Model

Page 2: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data ManagementTextbooks for COSC 6340

Required Text: Raghu Ramakrishnan and Johannes Gehrke, Data Management Systems, McGraw Hill, Third Edition, 2002 (complication: the chapter numbers in the new edition are different!!)

Recommended: Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufman Publishers, 2001, ISBN 1-55860-489-8 (4 chapters will be covered)

Other books with relevant material: Ramez Elmasri and Shamkant Navathe, Fundamentals of Database Systems, Third Edition Addison Wesley ISBN: 0-8053-1755-4

Page 3: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Lectures in COSC 6340 I: Basic Database Management Concepts --- Review of basic database concepts,

techniques, and languages (9 classes, Chapters 1-5, 8-11, and 16 of the textbook). III: Introduction to KDD and Data Warehousing centering on data warehouses,

OLAP, and data mining; moreover, more detailed coverage of querying and mining data streams and database clustering (5 classes; Chapters 1, 2, 6, and 7 of the Han/Kamber book & chapter 25 of our textbook and additional material)

III: Relational Database Design (2 classes, chapters 19) IV: Implementation of Relational Operators, Query Optimization, and

Physical Database Design (Chapters 12+14+20, 3-4 classes) V: Internet Databases and XML (1 class, chapter 27 of the textbook) VI: Other: discussion of home works and exams, student presentations; discussion

of course projects (3 classes)

Page 4: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Tentative Schedule COSC 6340 Jan. 18: Introduction to COSC 6340 Fast Review of Undergraduate Material (Jan. 20-Feb. 15)

Jan. 20: Entity-Relationship Data Model more detailed than textbook Jan. 25: Entity-Relationship Data Model Jan. 27: Relational Data Model Feb. 1: Mapping E/R Diagrams to Relations Feb. 8/10/15: Index & storage structures, B+-trees, and hashing, PDBD Feb. 15/17: Relational Algebra Feb. 22: Writing SQL Queries (somewhat short) Feb. 24: Leftovers / Review

March 1: Exam0 (Undergraduate Review Exam)

Page 5: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Tentative Schedule COSC 6340 Part2

II: KDD and Data Warehousing (approx. 5.5 classes) March 3: Introduction to KDD March 8: Similarity Assessment March 10: Clustering March 22: Association Rule Mining March 29: Spatial Databases March 31: Data Warehouses and OLAP April 8 (30 minutes): Spatial Data Mining

April 5+7+[8]: III: Relational Database Design (2.5 classes) April 14: Midterm Exam (30 minutes review on April 12, 2005) April 19+26+[28]: Student Presentations IV: Physical Database Design and Query Optimization (3 classes)

April 8(makeup): Implementation of Relational Operators April 8(makeup)/12: Introduction to Query Optimization April 12/19: Physical Database Design II

V: Internet Databases (1 class) April 21: Introduction to XML and Semi-Structured Data

April 28: Course Summary + Teaching Evaluation + Leftovers

Page 6: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Exams and Homeworks COSC 6340

Tu., March 1: Undergraduate Material Review Exam Th., April 14: Midterm Exam 3 graded homeworks: deadlines: Feb. 17, April 11, April 28 Ungraded Homeworks Final Exam: Tu/TH., May 10, 11a-1:30p Qualifying Exam Part2: Fr., May 13, 10-11:30a

Page 7: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Why are integrated databases popular?

Avoidance of uncontrolled redundancy Making knowledge accessible that would otherwise not be

accessible Standardization --- uniform representation of data facilitating

import and export Reduction of software development (though the availability of

data management systems)

IntegratedDatabase

BookkeepingDevice

Car Salesman

Page 8: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Popular Topics in Databases Efficient algorithms for data collections that reside on disks (or

which are distributed over multiple disk drives, multiple computers or over the internet).

Study of data models (knowledge representation, mappings, theoretical properties)

Algorithms to run a large number of transactions on a database in parallel; finding efficient implementation for queries that access large databases; database backup and recovery,…

Database design How to use database management systems as an application

programmer / end user. How to use database management systems as database

administrator How to implement database management systems Data summarization, knowledge discovery, and data mining Special purpose databases (genomic, geographical, internet,…)

Page 9: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Data Model

Data Model

Schema (definesa set of database

states)

Current Database State

is used to define

Page 10: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Schema for the Library Example using the E/R Data Model

Many-to-Many1-to-1 1-to Many Many-to-1

title

authorB#

when

phone

name

ssn

Check_out Person Book(0,35) (0,1)

Page 11: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Relational Schema for Library Example in SQL/92

CREATE TABLE Book (B# INTEGER, title CHAR(30), author CHAR(20), PRIMARY KEY (B#));

CREATE TABLE Person (ssn CHAR(9), name CHAR(30), phone INTEGER, PRIMARY KEY (ssn));

CREATE TABLE Checkout( book INTEGER, person CHAR(9), since DATE, PRIMARY KEY (book), FOREIGN KEY (book) REFERENCES Book, FOREIGN KEY (person) REFERENCES Person));

Page 12: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Referential Integrity in SQL/92

SQL/92 supports all 4 options on deletes and updates. Default is NO ACTION

(delete/update is rejected) CASCADE (also delete all tuples

that refer to deleted tuple) SET NULL / SET DEFAULT (sets

foreign key value of referencing tuple)

CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid,cid), FOREIGN KEY (sid) REFERENCES Students

ON DELETE CASCADEON UPDATE SET DEFAULT )

Page 13: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Example of an Internal Schemafor the Library Example

INTERNAL Schema Library12 references Library. Book is stored sequentially, index on B# using hashing, index on Author using hashing. Person is stored using hashing on ssn. Check_out is stored sequentially, index on since using B+-tree.

Page 14: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Example: Stored Database

Relation Book

(11, Y,W)(30, Z, B)

(20, Y,W)(51, C, B)

(1, C,W)

11151

Index on B#Block#= B# mod 10

0

1

2030

(101,…)Relation PersonBlock#= sss mod 10

0

1

(200,…)(500,…)

W,…

Index on Author

RelationCheckout

Index on since

Page 15: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Modern Relational DBMSModern Relational DBMS

Modern DBMS

Modern DBMS

Support for Web-Interfaces, XML, andData Exchange

Efficient Implementation of Queries (Query Optimization, Join & Selection & Indexing techniques)

Transaction Concepts; capability

of running manytransactions in

parallel; support forbackup and recovery.

Support for specialData-types: long fields,images, html-links, DNA-sequences,spatial information,…

Support for higher level user interfaces:graphical, natural language, form-based,…

Support for OLAP and Data Warehousing

Support for Data Mining

operations

Support for OO; capability to store operations

Supportfor data-driven computing

Page 16: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Disks and Files

DBMS stores information on (“hard”) disks. This has major implications for DBMS design!

READ: transfer data from disk to main memory (RAM). WRITE: transfer data from RAM to disk. Both are high-cost operations, relative to in-memory

operations, so must be planned carefully!

Page 17: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Remark: All reported disk performance/prize data are as of middle of 2003

Why Not Store Everything in Main Memory?

Costs too much. $100 will buy you either 512MB of RAM or 50GB of disk today --- that is disk storage 100 times cheaper (but it is approx. 10000 times slower).

Main memory is volatile. We want data to be saved between runs. (Obviously!)

Typical storage hierarchy: Main memory (RAM) for currently used data. Disk for the main database (secondary storage). Tapes for archiving older versions of the data

(tertiary storage).

Page 18: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Components of a Disk

Platters

The platters spin (say, 90rps).

Spindle

The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder (imaginary!).

Disk head

Arm movement

Arm assembly

Only one head reads/writes at any one time.

Tracks

Sector

Block size is a multiple of sector size (which is fixed).

Page 19: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Accessing a Disk Page

Time to access (read/write) a disk block: seek time (moving arms to position disk head on

track) rotational delay (waiting for block to rotate under

head) transfer time (actually moving data to/from disk

surface) Seek time and rotational delay dominate.

Seek time varies from about 1 to 20msec Rotational delay varies from 0 to 10msec Transfer rate is about 1msec per 32KB page

Page 20: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Review: The ACID properties

AA tomicity: All actions in the Xact happen, or none happen. CC onsistency: If each Xact is consistent, and the DB starts

consistent, it ends up consistent. II solation: Execution of one Xact is isolated from that of

other Xacts. D D urability: If a Xact commits, its effects persist.

The Recovery Manager guarantees Atomicity & Durability.

Page 21: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Example

Consider two transactions (Xacts):

T1: BEGIN A=A+100, B=B-100 ENDT2: BEGIN A=1.06*A, B=1.06*B END

Intuitively, the first transaction is transferring $100 from B’s account to A’s account. The second is crediting both accounts with a 6% interest payment.

There is no guarantee that T1 will execute before T2 or vice-versa, if both are submitted together. However, the net effect must be equivalent to these two transactions running serially in some order.

Page 22: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Atomicity of Transactions

A transaction might commit after completing all its actions, or it could abort (or be aborted by the DBMS) after executing some actions.

A very important property guaranteed by the DBMS for all transactions is that they are atomic.

DBMS logs all actions so that it can undo the actions of aborted transactions and redo the actions of successful transactions.

Page 23: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Concurrency in a DBMS

Users submit transactions, and can think of each transaction as executing by itself. Concurrency is achieved by the DBMS, which interleaves

actions (reads/writes of DB objects) of various transactions. Each transaction must leave the database in a consistent

state if the DB is consistent when the transaction begins. DBMS will enforce some ICs, depending on the ICs

declared in CREATE TABLE statements. Beyond this, the DBMS does not really understand the

semantics of the data. (e.g., it does not understand how the interest on a bank account is computed).

Issues: Effect of interleaving transactions, and crashes.

Page 24: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Example (Contd.)

Consider a possible interleaving (schedule):

T1: A=A+100, B=B-100 T2: A=1.06*A, B=1.06*B

This is OK. But what about:T1: A=A+100, B=B-100 T2: A=1.06*A, B=1.06*B

The DBMS’s view of the second schedule:

T1: R(A), W(A), R(B), W(B)T2: R(A), W(A), R(B), W(B)

Page 25: Christoph F. Eick Introduction Data Management This Week 1. Introduction to Databases 2. Course Information 3. Grading and Other Things 4. Questionnaire

Christoph F. Eick Introduction Data Management

Summary

Concurrency control and recovery are among the most important functions provided by a DBMS.

Users need not worry about concurrency. System automatically inserts lock/unlock requests and

schedules actions of different transactions in such a way as to ensure that the resulting execution is equivalent to executing the transactions one after the other in some order.

Write-ahead logging (WAL) is used to undo the actions of aborted transactions and to restore the system to a consistent state after a crash.