dr. alexandra i. cristea acristea/ cs 319: theory of databases: generalities

56
Dr. Alexandra I. Cristea http://www.dcs.warwick.ac.uk/ ~acristea/ CS 319: Theory of Databases: Generalities

Upload: evelyn-mills

Post on 28-Mar-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

Dr. Alexandra I. Cristea

http://www.dcs.warwick.ac.uk/~acristea/

CS 319: Theory of Databases: Generalities

Page 2: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

2

Lecturers

• Alexandra I. Cristea

• Hugh Darwen: [email protected]

• Other invited talks?: TBA

Page 3: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

3

Schedule• See website

• Exceptions:– TBA: check website, lectures, forum

Page 4: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

4

Slides• Thanks to:

– Dr. Paul Goldberg:• http://www.dcs.warwick.ac.uk/people/academic/Paul.Goldberg/cs319/

cs319index.html

– Dr. Meurig Beynon:• http://www.dcs.warwick.ac.uk/people/staff/Meurig.Beynon/

– Dr. Ad Aerts:• http://wwwis.win.tue.nl/~aaerts/

– Prof. Dr. Paul De Bra:• http://wwwis.win.tue.nl/~debra/

– Others: mentioned directly

Page 5: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

5

Contact

• Forum:

http://forums.warwick.ac.uk/wf/browse/category.jsp?cat=24 • IF (and ONLY IF) a question is personal, you

might address it to [email protected]

– FORMAT: subject of email should contain ‘CS319’ and topic of the email (otherwise it will be filtered out)

Page 6: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

6

Organisational

• Office hours, see website

• Per email is also possible – but availability depends on demand

• (general questions better to post on forum)

Page 7: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

7

Course site(s):• Current:

– http://www.dcs.warwick.ac.uk/~acristea/courses/CS319/ – Will contain current slides, as taught at the course– Will contain notifications: check BEFORE & AFTER the

course

• Official:– http://www.dcs.warwick.ac.uk/undergraduate/modules/cs319.html

• Past:– http://www.dcs.warwick.ac.uk/people/academic/Paul.Goldberg/

cs319/cs319index.html

Page 8: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

8

Content: Topics1. Generalities DB2. Integrity constraints (FD revisited)3. Relational Algebra (revisited)4. Query optimisation5. Tuple calculus6. Domain calculus7. Query equivalence8. Temporal Data9. The “Askew Wall”10. How to Handle Missing Information without Using

NULL

Page 9: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

9

Books

• Korth and Silberschatz, Database System Concepts, McGraw-Hill,1991.

• Ullman J D, Principles of Database Systems (Vols 1 and 2), Computer Science Press,1988.

Page 10: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

10

Purpose of this course

• More in-depth information on the theory of databases

• How (some of) the existing db languages fit in the theory (and how they don’t)

Page 11: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

11

Overlaps and sequencing

• Form: optional

• Prerequisites – CS252: Fundamentals of Database Systems– Optional: CS253: Topics in Database Systems

Page 12: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

12

Organization of the course• 15 CATS• CS, CSE, CBS, Mathematics• ~ 30 1-hour lectures• Exam at the end: 3 hours• Rules of the game:

– Read also comments on the slides.– Presence is optional, but beware: slides-only are

NOT ENOUGH to learn from for the exam; you need to participate, take your own notes, so self-study!

Page 13: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

13

Scope of CS 319

• Expressive power of db query languages• Algorithms for the computational problems • Limitations to classical relational theory

• A central contribution of theory is to say what cannot be done, not just what can.

• Consider how this observation is relevant to the above topics.

Page 14: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

14

A theme of CS319• How do theory and computing practice relate to

specific reference to databases?• Key ideas:• Theory: based on Codd's relational theory.

– There is an excellent correspondence between relational theory and practical database application of a certain kind.

• Relational databases can be seen as a precursor of 2 principal kinds of computer application: – environments for end-user programming and – computer-based models of real-world state.

Page 15: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

15

1. Database Generalities

Page 16: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

16

DB generalities : What is a database?

• Chris Date:– Database = computer-based record keeping system

• R.W. Engles: "A Tutorial on DB Organisation" (1974)– Db = collection of stored operational data used by

the applications system of a particular enterprise– enterprise: hospital, university, bank, company etc– operational data:

• data on products, accounts, patients etc• typically persistent cf conventional program

IO data

Page 17: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

17

DB generalities : Why use a db?• How to meet needs using a traditional file-processing

system supported by a conventional OS? • Files: permanent records of customers, accounts• Applications programs (APs): enable user to modify

files– to credit or debit an account– to add a new account– to find the balance in an account– to generate monthly statements

• APs written by systems programmers as required• new requirements new files + new programs

Page 18: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

18

Original context for data modelling 1970s style applications unsophisticated computer users batch mode interaction modest response times no visualisation or GUI modest expectations for ease-of-use

programming perceived as technical simple infrastructure and environment

no PC, web etc no live feeds of data textual interaction the norm

Page 19: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

19

Original context for data modelling • 1970s style applications

– Business context• simple business model, limited automation, access

etc• low volume of data• not initially distributed

– Computing context• existing/emerging DB proposals unconvincing• computers not very powerful• human and computing resources very expensive

Page 20: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

20

Summary of issues for data management

1. Data redundancy and inconsistency

2. Difficulty in accessing data

3. Data isolation

4. Concurrent access anomalies

5. Security problems

6. Integrity problems

Page 21: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

21

Data redundancy & inconsistency programmer uses ≠ format file, developed at

≠ stage in history of enterprise

=> data duplicated:

+ in a DB rationalise and standardise data

[rationalise: shared source for data]

… rationalise doesn't necessarily mean centralise

Page 22: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

22

– compromises are needed

• where users suit themselves => efficient results perfect data organisation to suit all users

– duplication: insurance against info loss

Data redundancy & inconsistency

Page 23: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

23

Difficulty in accessing data

• unforeseen requests, new functionality

• new programs, possibly new data structures

+ in a DB, simplify access & manipulation by intelligent organisation of data cf. modelling approach to requirements

e.g. in use of UML in OOSE

Page 24: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

24

Data isolation

data to retrieve from many sources in APs + in DB, hide source physical data :

higher level of abstraction

– automation: less human interaction with datarisk of corrupted data

Page 25: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

25

Concurrent access anomalies

• would like multiple access (efficiency & faster response)e.g. simultaneous withdrawal

+ concurrency can't be managed without a form of overall control

Page 26: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

26

Security problems

• to restrict access to (un)authorised users for confidential info

+ security needs a form of overall control

– issues: where should the control be? inside or outside computer system

* e.g., non-trivial problem to determine what can be inferred from responses to queries that aren't explicit

Page 27: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

27

Integrity problems

• integrity constraints– may arise dynamically: – difficult to modify programs to cope; – hard to guarantee if data stored in ≠ files

+ automated management demands a form of overall control

– automation reduces scope for human intervention / interpretation

Page 28: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

28

Conclusions: Issues for data management

• For many commercial applications, a good solution is offered by a database management system (DBMS).

• A DBMS is an unconventional OS operating over a structured file system.

Page 29: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

29

Motivating the DBMS concept

• abstract model of corpus of operational data that simplifies data processing activity:– simple queries can be handled without writing

new application programs– if APs need => accessing & manipulating

operational data consistently and efficiently is simplified

Page 30: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

30

The ingredients of a database

Dataintegratedshared(possibly distributed)

Hardwarestorage

Softwaredatabase management system: DBMSprotects users from hardware level detailserves the needs of all users

Page 31: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

31

DB Users• end-user:

– non-specialist accessing data via a query language

– naïve user accessing data via a special-purpose interface

performs data retrieval and update (extend / modify)

• applications programmer:– writes programs that use the DB by embedding queries to

the DB in a HLL

– develops interfaces for the naïve user

Page 32: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

32

DB Users Database Administrator (DBA):

responsible for overall control

decides what data is to be stored

designs the conceptual scheme

used to represent the operational data

implements authorisation checks

decides strategy for backup and recovery

monitors performance

oversees modification to suit user requirements

Page 33: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

33

Data abstraction in a db

• addresses issues of design, use, management and implementation

• Data model describes (formally) 3 different levels of abstraction:

physical level

conceptual level

view level

Page 34: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

34

3 Levelsphysical level:how is the data actually represented in the hardware?

bits, bytesconceptual level:what meaningful relationships are expressed by the

physical data?entities, and relationships between entities

view level:what particular relationships are required by users?

more abstract because partial typically very high-level knowledge constitutes the view

Page 35: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

35

Illustrating data abstraction:

DB w. date of birth of a client (bit string).

senior citizens?: clients aged > 65

Representations at ≠ abstraction levels:

Conceptual: date of birth

Physical: bit string

View: age (not stored in DB!!)

Page 36: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

36

Data abstraction in a database

views conceptual model

physical layout

external schemas

internal schemas

sub-schemes

conceptual scheme

physical scheme

USE

DESIGN&

MANAGEMENT IMPLEMENTATION

Page 37: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

37

Role of abstraction (1)

• Internal & external translation schemas:

* protect conceptual model from change

• when physical organisation changes / new views are required

* protect user from a need to change views

* protect altering physical organisation

• if conceptual model is modified

Page 38: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

38

Role of data abstraction (2)

• physical data independence:

protecting conceptual model from change when

physical organisation changes

• logical data independence:

protecting user from need to change views when

conceptual model changes

Page 39: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

39

Characteristics of electronic data 1970

• “Abstract model of the entire corpus of operational data”• Demands of the abstract model in 1970 quite low …

– small volumes of data, modest performance– limited levels of volatility and automation tolerated

• Today different, BUT – (subject to viewing human agency as a metaphor for any

agency, )– key issues of a classical database are still vital

• Any DB modelling paradigm must handle 70s problems

Page 40: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

40

db models

• 2 principal kinds of abstract data model:– object-based models– record-based models

• earliest• reflects file system culture they displaced

Page 41: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

41

Object-based models

• main models:– entity-relationship models– object-oriented data models

• Others: semantic & functional data models.

• E-R model widely used to model data abstractly

• OO model gaining acceptance in practice: effectively represents data + operations on it.– RDF, OWL

Page 42: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

42

Record-based Logical Models

• Used at the conceptual and view levels. Specify both– overall logical structure of the database– higher-level description of the implementation.

• Record-based: uses records in fixed-format of several types. simplifies implementation <> trend towards richness

and variety in structures used to implement OODBs

Page 43: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

43

Varieties of record-based logical model

hierarchical model

records & links organised as a family of trees

network model

records & links organised as a family of graphs

relational model

uses tables to record relationships between data

Page 44: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

44

Physical Data Models

• not our primary concern in this module.• Relevant issues here would be:

– are data tables stored using hashing?– how are data tables indexed?– how are entries in data tables encoded and ordered?– what algorithms are used to retrieve and update?

Page 45: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

45

Classical database features

Instances and SchemesState of DB changes over time: structure vs. current

state.

overall design of DB = database schemacurrent content of DB = instance of the DB

Useful analogy with procedural variables:database schema type definition for variableinstance of database value of the variable

Page 46: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

46

Data abstraction and schemas

• physical scheme at lowest level

• conceptual scheme at intermediate level

• several sub-schemes (possibly user-defined) at highest level (views of the DB)

Page 47: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

47

Data Definition Language (DDL)

• for database schema • compiling DDL > Data Dictionary• the storage & access methods: specified in storage &

definition languageImplementation details usually hidden from users

Page 48: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

48

Data Manipulation Language (DML)

• data manipulation: accessing DB to retrieve, insert, delete, or modify data

• data retrieval– most common – "querying the DB"

• retrieval component of DML = query language (abusively: ‘query language’ ~ synonym DML)

Page 49: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

49

Varieties of Data Manipulation Language

• tension between• efficiency at physical level• intelligibility / ease of use at higher level

• procedural: requires knowledge of data implementation

• non-procedural: need only specify what data is needed

Page 50: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

50

Data Manipulation Languages (DML) for typical data models

• Procedural: – object-based, hierarchical, network models – user explicit responsibility: optimising queries,

• needs knowledge of data organisation

• Non-procedural: – relational models– formulate queries without above,

• implementation has to be optimised

Page 51: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

51

Optimisation• Database Manager (program module!)

– interfaces between low-level data in DB and application programs & user queries.

• Large volumes of data (available technology)– gigabytes thousand megabytes = 1 billion bytes [!]– terabytes million megabytes = 1 trillion bytes

• Requires auxiliary storage media, such as disk, CD etc.

• Optimisation is primarily concerned with eliminating data transfers between main and auxiliary memory.

Page 52: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

52

Functions of the DB manager program module

• query processing– interacting with the file manager modules doing actual

operations on physical data• integrity enforcement

– checking that data in the DB conforms to specified constraints

• security enforcement– ensuring that authorisation is given for access to data

• backup and recovery– coping with failure, and recovery to consistent DB state

• concurrency control– ensuring that simultaneous transactions do not interfere.

Page 53: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

53

Overall DB system structure• Processing components

– file manager– database manager– query processor– DML precompiler (to process DML embedded in APs)– DDL compiler

• Data structures– data files: actual content of db– data dictionary meta-data– Indices: auxiliary files to assist fast access

Page 54: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

54

Summary

• We have revisited and reconsidered the generalities of database systems– Background of appearance– Main problems they try to solve– Components & languages

Page 55: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

55

… to follow

How to reason with Integrity Enforcements (Constraints):Functional Dependencies (FDs) applied

Page 56: Dr. Alexandra I. Cristea acristea/ CS 319: Theory of Databases: Generalities

56

• Questions?