fall 2002cse330/cis550 handout 11 the relational model: relational algebra

32
Fall 2002 CSE330/CIS550 Handout 1 1 The Relational Model: Relational Algebra

Upload: ashlie-doyle

Post on 18-Jan-2018

224 views

Category:

Documents


1 download

DESCRIPTION

Fall 2002CSE330/CIS550 Handout 13 The Relational Model- An introduction In the first few lectures we are going to discuss relational query languages. –We'll start by discussing the relational algebra, a “theoretical language”. Later we'll discuss -- and use -- the “commercial standard”, SQL. –Limitations of the relational algebra will also be discussed by contrast with a logical language, Datalog. The “theoretical language” is also used as an internal language to implement and optimize SQL.

TRANSCRIPT

Page 1: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 1

The Relational Model: Relational Algebra

Page 2: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 2

Data Models and database design

• When we design a database we try to think “logically”, but need some kind of framework in which to design the database.

• It is like designing a data structure in some programming language. You might use arrays, lists, etc. depending on what is available. A data model is like a type system, but is abstract.

• In the relational data model we organize the data into tables. We don't (initially) worry about how these tables are implemented.

Page 3: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 3

The Relational Model-An introduction

• In the first few lectures we are going to discuss relational query languages. – We'll start by discussing the relational algebra, a

“theoretical language”. Later we'll discuss -- and use -- the “commercial standard”, SQL.

– Limitations of the relational algebra will also be discussed by contrast with a logical language, Datalog.

• The “theoretical language” is also used as an internal language to implement and optimize SQL.

Page 4: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 4

What is a relational db?• As you probably guessed, it is a collection of

tables. Routes RId RName Grade Rating Height1 Last Tango 2 12 1002 Garden Path 1 2 603 The Sluice 1 8 604 Picnic 3 3 400

Climbers CId Cname Skill Age123 Edmund EXP 80214 Arnold BEG 25 313 Bridget EXP 33212 James MED 27

Climbs CId RId Date Duration123 1 10/10/88 5 123 3 11/08/87 1 313 1 12/08/89 5 214 2 08/07/92 2 313 1 06/07/94 3

Page 5: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 5

Why is the database like this?• Each route has an id, a name, a grade (an estimate of

the time needed), a rating (how difficult it is), and a height.

• Each climber has an id, a name, a skill level and an age.

• A climb records who climbed what route on what date and how long it took (duration).

• We will deal with how we arrive at such a design later. Right now observe that the data values in these tables are all “simple”. None of them are complex structures -- like other relations.

Page 6: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 6

Some terminology• The column names of a relation are often

called attributes or fields. The number of these columns is called the arity of the relation.

• The rows of a relation are called tuples• Each attribute has values taken from a

domain. For example, the domain of CName is string and that for rating is real.

• A relation is a set of tuples; no tuple can occur more than once. Objects differ in that they have “identity”.

Page 7: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 7

Describing Relations• Relations are described by a schema

which can be expressed in various ways, but to a DBMS is usually expressed in a data definition language (DDL)-- something like a type system of a programming language.Routes(RId:int, RName:string, Grade:int,

Rating:int, Height:int)Climbers(CId:int, CNname:string,

Skill:string, Age:int)Climbs(CId:int, RId:int, Date:date,

Duration:int)

Page 8: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 8

A note on domains• Relational DBMSs have fixed “built-in” domains,

such as int, string etc. Also some other domains like date but not, for example, roman-numeral (which might be useful here).

• In object-oriented and object-relational systems, new domains can be added either by the programmer/user or are sold by the vendor.

• Database people, when they are discussing design, often get sloppy and forget domains. They write, for example, Routes(RID, RName, Grade, Rating, Height)

Page 9: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 9

Integrity Constraints• Domains are, in a sense, a primitive form of constraint

on a valid instance of the schema. Other important constraints include:– Key constraints: each tuple must be distinct. A key is a subset

of fields that uniquely identifies a tuple, and for which no subset of the key has this property.

– Inclusion dependencies (referential integrity constraints): a field in one relation may refer to a tuple in another relation by including its key. The referenced tuple must exist in the other relation for the database instance to be valid.

• Typically, a relation may have several candidate keys one of which is chosen as the primary key.

Page 10: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 10

Expressing constraints• In SQL-92, these constraints are defined

as follows:CREATE TABLE Climbers CREATE TABLE Climbs (CId INTEGER, (CId INTEGER, CName CHAR(20), RId INTEGER, Skill CHAR(4), Date DATE, Age INTEGER, Duration INTEGER, PRIMARY KEY (Cid), PRIMARY KEY (CId, RId), UNIQUE (CName,Age)) FOREIGN KEY (CId) REFERENCES Climbers, FOREIGN KEY (RId) REFERENCES Routes)

Page 11: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 11

ExampleThe instances below satisfy these constraints.

• Insert (123, Jeremy, MED, 16) into Climbers? • Insert (456, 2, 09/13/98, 3) into Climbs? • Delete (313, Bridget, EXP, 33) from Climbers? • Modify 123 to 456 in Climbers?

Climbers: Climbs:CId CName Skill Age CId RId Date Duration123 Edmund EXP 80 123 1 10/10/88 5 214 Arnold BEG 25 123 3 11/08/87 1 313 Bridget EXP 33 313 1 12/08/89 5 212 James MED 27 214 2 08/07/92 2 313 1 06/07/94 3

Page 12: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 12

Relational Algebra• Relational algebra is a set of operations (functions)

each of which takes a relation (or relations) as input and produces a relation as output. There are five basic operations: – Projection– Selection– Union– Difference– Product

• Using these we can build up sophisticated database queries.

Page 13: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 13

Projection• Given a list of column names A and a

relation R, extracts the columns in A from the relation. Example:

RA

Routes:RId RName Grade Rating Height 1 Last Tango 2 12 100 2 Garden Path 1 2 60 3 The Sluice 1 8 60 4 Picnic 3 3 400

:Routes,HeightRIdRId Height 1 1002 603 604 400

Page 14: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 14

Projection, cont.• Suppose the result of a projection has a

repeated value, how do we treat it?

• In “pure” relational algebra the answer is always a set (the second answer). However SQL and some other languages return, by default, a multiset (the first answer).

Height 100 60 60 400

Height 100 60 400

RoutesHeight

Page 15: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 15

Selection• Selection takes a relation R and

extracts those rows from it that satisfy the condition C . For example,

RC

:Routes60Height

RId RName Grade Rating Height 2 Garden Path 1 2 60 3 The Sluice 1 8 60

Page 16: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 16

What can go in a condition?

• Conditions are built up from boolean-valued operations on the field names. E.g. Height>100, RName = "Picnic“, Rating=Height

• Predicates constructed from these using logical or, and, not

• It turns out that we don't lose any expressive power if we don't have complex predicates in the language, but they are convenient and useful in practice.

,,

Page 17: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 17

Set operations -- Union• If two relations have the same structure

(Database terminology: are union-compatible. Programming language terminology: have the same type) we can perform set operations.

Climbers: Hikers:CId CName Skill Age CId CName Skill Age 123 Edmund EXP 80 214 Arnold BEG 25 214 Arnold BEG 25 898 Jane MED 39313 Bridget EXP 33 212 James MED 27 CId CName Skill Age

123 Edmund EXP 80 214 Arnold BEG 25 313 Bridget EXP 33 212 James MED 27 898 Jane MED 39

:HikersClimbers

Page 18: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 18

Set operations -- difference

• An example:Beginners: Climbers – Beginners:CId CName Skill Age CId CName Skill Age214 Arnold BEG 25 123 Edmund EXP 80 987 Zoey BEG 18 313 Bridget EXP 33 212 James MED 27Climbers: CId CName Skill Age123 Edmund EXP 80214 Arnold BEG 25313 Bridget EXP 33 212 James MED 27

Page 19: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 19

Set operations -- other• It turns out we can implement the other

set operations using those we already have. For example, what about set intersection?

• Again, we have to be careful. Although it is mathematically nice to have fewer operators, operations like set difference may be less efficient than intersection.

Page 20: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 20

Optimizations -- a hint of things to come

• We mentioned earlier that compound predicates in selections were not “essential” to relational algebra. This is because we can translate selections with compound predicates into set operations. Example:

• However, which do you think is more efficient?• Also, how would you translate ?

RRRRRR

DCDC

DCDC

RC

Page 21: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 21

Database Queries• Queries are formed by building up expressions with

the operations of the relational algebra. Even with the operations we have defined so far we can do something useful. For example, select-project expressions are very common:

– What does this mean in English?– Also, could we interchange the order of the and Can we

always do this?• As another example, how would you “delete” the

climber named James from the database?

Climbers)( 30, AgeAgeName

Page 22: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 22

Joins• Join is a generic term for a variety of

operations that connect two relations that are not union compatible. The basic operation is the product, Rx S, which concatenates every tuple in R with every tuple in S. A B x C D = A B C Da1 b1 c1 d1 a1 b1 c1 d1a2 b2 c2 d2 a1 b1 c2 d2 c3 d3 a1 b1 c3 d3 a2 b2 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3

Page 23: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 23

Products, cont.• What happens when we form a product of two

relations with columns with the same name? Details vary, but a common answer is to suffix the attribute names with 1 and 2.

• Climbs x Climbers will have a schema: (Cid:1, RId, Date, Duration, Cid:2, CName, Skill, Age)

Climbers: Climbs:CId CName Skill Age CId RId Date Duration123 Edmund EXP 80 123 1 10/10/88 5 214 Arnold BEG 25 123 3 11/08/87 1 313 Bridget EXP 33 313 1 12/08/89 5 212 James MED 27 214 2 08/07/92 2 313 1 06/07/94 3

Page 24: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 24

Products, cont.• Products are hardly ever used alone; they are typically

use in conjunction with a selection.

• Note that this relation has useful information. We can tell, for example, the names of climbers who have climbed a certain route.

:)ClimbersClimbs(2:1: CIdCId

CId.1 RId Date Duration CId.2 CName Skill Age 123 1 10/10/88 5 123 Edmund EXP 80 123 3 11/08/87 1 123 Edmund EXP 80 313 1 12/08/89 5 313 Bridget EXP 33 214 2 08/07/92 2 214 Arnold BEG 25 313 1 06/07/94 3 313 Bridget EXP 33

Page 25: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 25

Theta Joins• The combination of a selection and a product is so

common that we give it a special symbol (and name)

• Example:

• The condition in a theta join is almost always an equality or conjunction of equalities. (Note: the name “theta” refers to the condition, C; this is also called the “conditional” join.)

)( SRSR CC

ClimbersClimbs 2:1: CIdCId

Page 26: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 26

Renaming• Our example yields a relation with fields Cid:1

and Cid:2 with the same information. Almost certainly we want to get rid of one of them, and this can be done using projection.

• We probably also want to rename the remaining field Cid:1 to CId. For this we need a renaming operation , which renames the a attribute of R to b. In practical query languages, renaming is carried out by a different means, and we shall usually ignore this unimportant operation.

Rba

Page 27: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 27

Natural Join• The most common join to do is an equality

join of two relations on commonly named fields, and to leave one copy of those fields in the resulting relation. This is what we just did with Climbs and Climbers. This is called natural join and its symbol is (no subscript).

CId RId Date Duration CName Skill Age 123 1 10/10/88 5 Edmund EXP 80 123 3 11/08/87 1 Edmund EXP 80 313 1 12/08/89 5 Bridget EXP 33 214 2 08/07/92 2 Arnold BEG 25 313 1 06/07/94 3 Bridget EXP 33

:ClimbersClimbs

Page 28: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 28

Examples• This completes the basic operations of the

relational algebra. We shall soon find out in what sense this is an adequate set of operations. Try writing queries for these:– The names of climbers older than 32.– The names of climbers who have climbed route 1.– The names of climbers who have climbed the route

named Last Tango.– The names of climbers with age less than 40 who have

climbed a route with rating higher than 5.– The names of climbers who have not climbed anything.

Page 29: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 29

Division (not in the book)• Division is a somewhat messy operation

and can be expressed in terms of the operations we have already defined. It is used to express queries such as “The CId's of climbers who have climbed all routes”.

• Another way of phrasing this is to ask for “The Cid’s of climbers for which there does not exist a route that they haven’t climbed.”

Page 30: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 30

Division, cont.Let's express this query with the operations

we have already defined.• First we can build a relation with all possible

pairs of routes and climbers:

Let's call this relation Allpairs.• Next, compute the set of all (Cid,RId) pairs

for which climber CId has not climbed route RId. Let’s call this relation NotClimbed:

Routes)( xClimbers)( RIdCId

ClimbsAllpairs ,RIdCId

Page 31: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 31

Division, cont.• Next, is the set of id's of

climbers who have not climbed some route.

• Finally, the climbers who have climbed all routes are the ones who have not failed to climb some route:

)NotClimbed(CId

Climbs) - Route) xClimbers(( -Climbers

)NotClimbed( -Climbers

,RIdCIdRIdCIdCId

CId

CIdCId

Page 32: Fall 2002CSE330/CIS550 Handout 11 The Relational Model: Relational Algebra

Fall 2002 CSE330/CIS550 Handout 1 32

Division: the operator• Rather than write this long expression, it is

easier to use the notation . The schema of R must be a superset of the schema of S, and the result has schema schema(R)-schema(S).

• We could write “Climbers who have climbed all routes” as

• What about “Routes that have been climbed by all climbers”?

SR

)Routes(Climbs, RIdRIdCId