database systems i - uni-tuebingen.de

Database Systems IFoundations of Databases

Summer term 2010

Melanie [email protected]

Database Systems Group, University of Tübingen

1

Foundations of Databases | Summer term 2010 | Melanie Herschel | University of Tübingen

Chapter 6SQL

1. Introduction2. ER-Modeling3. Relational model(ing)4. Relational algebra5. SQL6. Programming7. Advanced topics

•After completing this chapter, you should be able to

‣ Write advanced SQL queries including, e.g., multiple tuple variables over different/the same relation,

‣ Use aggregation, grouping, UNION,

‣ Be comfortable with the various join variants,

‣ Evaluate the correctness and equivalence of SQL queries,

‣ Judge the portability of certain SQL constructs.

2

Foundations of Databases | Summer term 2010 Melanie Herschel | University of Tübingen

Chapter 6SQL

• SELECT-FROM-WHERE Blocks, Tuple Variables

• Subqueries, Non-Monotonic Constructs

• Aggregation

• Union, Conditional Expressions

• ORDER BY

• SQL-92 Joins, Outer Joins3


Basic SQL Query Syntax

4

•The FROM clause declares which table(s) are accessed in the query.

•The WHERE clause specifies a condition for rows (or row combinations) in these tables that are considered in this query (absence of C is equivalent to C ≡ TRUE).

•The SELECT clause specifies the attributes of the result tuples (* ≡ output all attributes occurring in R1,...,Rm).

Basic SQL query (extension follows)

SELECT A1, ..., AnFROM R1, ..., RmWHERE C


Example Database (again)

5

Sample Movie Database Tables

AID Name DOB1 Jolie 4.6.75

2 Pitt 18.12.63

3 Depp 9.6.1963

ActorMID Title Year Rating1 Babel 2006 7

2 Inglorious Bastards 2009 8

3 Wanted 2008 3

4 King Kong 1960 5

5 King Kong 2003 4

Movie

AID MID Name1 3 Fox

2 1 Richard Jones

2 2 Lt. Aldo Raine

Role


Tuple Variables

6

•The FROM clause can be understood as declaring variables that range over tuples of a relation:

•This may be executed as

•Tuple variable A represents a single row of table Actor (the loop assigns each row in succession).

SELECT A.Name, A.DOBFROM ACTOR AWHERE A.AID = 1

foreach A ∈ Actor do if A.AID = 1 then print A.Name, A.DOB fiod


Tuple Variables

7

•For each table in the FROM clause, a tuple variable is automatically created: if not given explicitly, the variable will have the name of the relation:

In other words, stating FROM Actor only is understood as FROM Actor Actor (tuple variable Actor ranges over rows of table Actor).

• If a tuple variable is explicitly declared, e.g., FROM Actor A, then the implicit tuple variable Actor is not declared, i.e., Actor.AID will yield an error.

•A reference to an attribute A of a tuple variable R may be simply written as A (instead of R.A) if R is the only variable which is bound to tuples having an attribute named A.

SELECT Actor.Name, Actor.DOBFROM ActorWHERE Actor.AID = 1


Attribute References

8

• In general, attributes are referenced in the form R.A

• If an attribute may be associated to a tuple variable in an unambiguous manner, the variable may be omitted, e.g.:

Title can only refer to M.Title, Name can only refer to R.Name, Rating can only refer to M.Rating.

SELECT Title, NameFROM Movie M, Role RWHERE M.MID = R.MIDAND Rating > 5


Attribute References

9

• Consider this query:!

•Although forced to be equal by the join condition, SQL requires the user to specify unambiguously which of the MID attributes (bound to M or R) is meant in the SELECT clause.

•The unambiguity rule thus is purely syntactic and does not depend on the query semantics.

SELECT M.MID, Title, NameFROM Movie M, Role RWHERE M.MID = R.MIDAND Rating > 5


Joins

10

• Consider a query with two tuple variables:!

•M will range over 3 tuples in Movie, R will range over 2 tuples in Actor. In principle, all 3 " 2 = 6 combinations will be considered in condition C:

SELECT A1, ..., AnFROM Movie M, Role RWHERE C

foreach M ∈ Movie do foreach R ∈ Role do if C then print A1,...,An; fi odod


Joins

11

•A good DBMS will use a better evaluation algorithm (depending on the condition C).

This is the task of the query optimizer. For example, if C contains the join condition M.MID = R.MID, the DBMS might loop over the tuples in Movie and find the corresponding Role tuple by using an index over Role.MID (many DBMS automatically create an index over the key attributes).

• In order to understand the semantics of a query, however, the simple nested foreach algorithm suffices.

The query optimizer may use any algorithm that produces the exact same output, although possibly in different tuple order.


Joins

12

•A join needs to be explicitly specified in the WHERE clause:

SELECT M.Title, R.NameFROM Movie M, Role RWHERE M.MID = R.MID -- Join conditionAND M.Rating > 5

Output of this query?

SELECT R.Name, R.MIDFROM Actor A, Role RWHERE A.Name = ‘Pitt’


Joins

13

• Guideline: it is almost always an error if there are two tuples variables which are not linked (directly or indirectly) via join conditions.

• In this query, all three tuple variables are connected

• The conditions correspond to the foreign key relationships between the tables.

• The omission of a join condition will usually lead to numerous duplicates in the query result.

Usage of DISTINCT to get rid of duplicate does not fix the error in such a case, of course!

SELECT M.Title, A.NameFROM Movie M, Role R, Actor AWHERE M.MID = R.MID -- Link Movie/RoleAND R.AID = A.AID -- Link Role/ActorAND A.Name = ‘Pitt’AND M.YEAR > 1990


Query Formulation

14

• To formulate this query:

• We observe that actors born before 1960 are of interest! requires a tuple variable, say A, over Actor and the identifying condition A.DOB < ‘1.1.1960’

• Movies produced after 2000 are of interest, too.! We create a tuple variable, say M, for Movie and the condition M.Year > 2000.

• As part of the output, we only require actor names. It is sufficient to list each name once, but several movies may feature the same actor.! From clause should be written as SELECT DISTINCT A.Name

Formulate the following query in SQL

“Which are the names of all actors starring in movies produced after 2000 and that were born before 1960?”


Query Formulation

15

• So far, we have:

• We note that M and A are still unconnected.

• The connection graph of the tables in a database schema (edges correspond to foreign key relationships) helps in understanding the connection requirements:

• We see that the M–A connection is indirect and needs to be established via a tuple variable R over Role. The final query thus is:

SELECT DISTINCT A.NameFROM Movie M, Actor AWHERE M.YEAR > 2000AND A.DOB < ‘1.1.1960’

Movie Role Actor

SELECT DISTINCT A.NameFROM Movie M, Actor A, Role RWHERE M.YEAR > 2000AND A.DOB < ‘1.1.1960‘ AND M.MID = R.MID AND R.AID = A.AID


Query Formulation

16

• It is not always that trivial. The connection graph may contain cycles, which makes the selection of the “right path” more difficult (and error-prone).

• Consider a movie database that also stores general information about awards as well as what movies and actors received awards:

Movie Role Actor

AwardMovieWinners ActorWinners


Unnecessary Joins

17

• Do not join more tables than needed! The query will run slowly if the optimizer overlooks the redundancy.

Rating for Movie 1

SELECT M.Title, M.RatingFROM Movie M, Role RWHERE M.MID = R.MIDAND R.MID = 1

Will the following query produce the same results?

SELECT Title, RatingFROM Movie MWHERE M.MID = 1


Unnecessary Joins

18

What will be the result of this query?

SELECT M.Title, M.RatingFROM Movie M, Role RWHERE M.MID = 1

Is there any difference between these two queries?

SELECT A.Name, A.DOBFROM Actor A

SELECT DISTINCT A.Name, A.DOBFROM Actor A, Role RWHERE A.AID = R.AID


Self Joins

19

• In some query scenarios, we might have to consider more than one tuple of the same relation in order to generate a result tuple.

Is there a sequel (same title, larger year) of movie ‘King Kong’?

SELECT M2.*FROM Movie M1, Movie M2WHERE M1.Title = M2.TitleAND M1.Year < M2.YearAND M1.Title = ‘King Kong’

Is this a correct query to determine which actors have same date of birth as the actor named ‘Pitt’?

SELECT A2.NameFROM Actor A1, Actor A2WHERE A1.DOB = A2.DOBAND A1.Name = ‘Pitt’


Duplicate Elimination

20

• A core difference between SQL and relational algebra is that duplicates have to be explicitly eliminated in SQL.

What movie titles are stored in the DB?

TitleBabel

Inglorious Bastards

Wanted

King Kong

King Kong

SELECT TitleFROM Movie



21

• If a query might yield unwanted duplicate tuples, the DISTINCT modifier may be applied to the SELECT clause to request explicit duplicate row elimination:

• To emphasize that there will be duplicate rows (and that these are wanted in the result), SQL provides the ALL modifier (SELECT ALL is the default).

What movie titles are stored in the DB?

TitleBabel

Inglorious Bastards

Wanted

King Kong

SELECT DISTINCT TitleFROM Movie



22

• Sufficient condition for superfluous DISTINCT:

(1) Let K be the set of attributes selected for output by the SELECT clause.

(2) Add to K attributes A such that A = c!(constant c) appears in the WHERE clause.

Here we assume that the WHERE clause specifies a conjunctive condition.

(3) Add to K attributes A such that A = B (B ∈ K) appears in the WHERE clause. If K

contains a key of a tuple variable, add all attributes of that variable.

Repeat (3) until K stable.

(4) If K contains a key of every tuple variable listed under FROM, then DISTINCT is

superfluous.



23

• Let us assume that (Title, Year) is declared as alternative key for Movie.

(1) Initialize K = {M.Title, M.Year, R.AID, R.Name}.

(2) K = K ∪ {M.Year} because of the conjunct M.Year = 2000.

(3) K = K ∪ {M.MID, M.Rating} because K contains a key of Movie (M.Title,

M.Year).

(3) K = K ∪ {R.MID, R.Name} because of the conjunct M.MID = R.MID and key

of Role is contained in K

(4) K contains a key for Movie and Role so DISTINCT is superfluous.

Is DISTINCT superfluous?

SELECT DISTINCT M.Title, M.Year, R.AID, R.NameFROM Movie M, Role RWHERE M.MID = R.MIDAND M.Year = 2000


Summary: Query Formulation Traps

24

• Missing join conditions (very common).

• Unnecessary joins (may slow query down significantly).

• Self joins: incorrect treatment of multiple tuple variables which range over the same relation (missing (in)equality conditions).

• Unexpected duplicates, often an indicator for faulty queries (adding DISTINCT is no cure here).

• Unnecessary DISTINCT.Although today’s query optimizer are probably more “clever” than the average SQL user in proving the absence of duplicates.


Chapter 6SQL



• Aggregation


• ORDER BY



Non-Monotonic Behavior

26

• SQL queries using only the constructs introduced above compute monotonic functions on the database state: if further rows get inserted, these queries yield a superset of rows.

• However, not all queries behave monotonically in this way.

Example: “Find all movies that are not associated with any roles.”In the current DB state, both King Kong movies would be a correct answer. INSERT INTO Role VALUES (2, 5, ‘King Kong’) would invalidate the 2003 version of the movie.

• Obviously, such queries cannot be formulated with the SQL constructs introduced so far.


Non-Monotonic Behavior

27

• Note: in natural language, queries which contain formulations like “there is no”, “does not exists”, etc., indicate non-monotonic behavior (existential quantification).

• Furthermore, “for all”, “the minimum/maximum” also indicate non-monotonic behavior: in this case, a violation of a universally quantified condition must not exist.

• In an equivalent SQL formulation of such queries, this ultimately leads to a test whether a certain query yields a (non-)empty result.


NOT IN

28

• With IN (∈) and NOT IN (∉) it is possible to check whether an attribute value appears in a set of

values computed by another SQL subquery.

• At least conceptually, the subquery is evaluated before the evaluation of the main query starts:

• Then, for every Movie tuple, a matching MID is searched for in the subquery result. If there is none (NOT IN), the tuple is output.

Movies without any roles

SELECT Title, YearFROM Movie WHERE MID NOT IN (SELECT MID FROM Role)

Title YearKing Kong 1960

King Kong 2003

MID3

1

2

subquery resultMID Title Year Rating1 Babel 2006 7

2 Inglorious Bastards 2009 8

3 Wanted 2008 3

4 King Kong 1960 5

5 King Kong 2003 4

Movie


NOT IN

29

• Since the (non-)existence of particular tuples does not depend on multiplicity, we may equivalently use DISTINCT in the subquery:

• The effect on the performance depends on the DBMS and the data (sizes).

A reasonable optimizer will know about the NOT IN semantics and will decide on duplicate elimination/preservation itself, esp. because IN (NOT IN) may efficiently be implemented via semi-join and anti-join if duplicates are eliminated.

SELECT Title, YearFROM Movie WHERE MID NOT IN (SELECT DISTINCT MID FROM Role)


NOT IN

30

Names of actors that starred in at least one movie.

SELECT NameFROM ActorWHERE AID IN (SELECT AID FROM Role)

Is there a difference to this query (with or without DISTINCT?)

SELECT DISTINCT NAMEFROM Actor A, Role RWHERE A.AID = R.AID


NOT IN

31

• In SQL-86, the subquery is required to deliver a single output column.

This ensures that the subquery is a set (or multiset) and not an arbitrary relation.

• In SQL-92, comparisons were extended to the tuple level.It is thus valid to write, e.g.:

SELECT ...FROM ...WHERE (A, B) NOT IN (SELECT C, D FROM ...)


NOT EXISTS

32

• The construct NOT EXISTS enables the main (or outer) query to check whether the subquery result is empty.

• In the subquery, tuple variables declared in the FROM clause of the outer query may be referenced.

You may also do so for IN subqueries but this yields unnecessarily complicated query formulations (bad style).

• In this case, the outer query and subquery are correlated. In principle, the subquery has to be evaluated for every assignment of values to the outer tuple variables. (The subquery is “parameterized”.)


NOT EXISTS

33

• Tuple variable A loops over the three rows in Actor. Conceptually, the subquery is evaluated three times (with A.AID bound to the current AID value).

• Again: the DBMS is free to choose a more efficient equivalent evaluation strategy (query unnesting).

Actors that did not star in any movie

SELECT NameFROM Actor AWHERE NOT EXISTS (SELECT * FROM Role R WHERE R.AID = A.AID)


NOT EXISTS

34

• First, A is bound to the Actor tuple

• In the subquery, A.AID is “replaced by” 1 and the following query is executed:

• Since the result of the subquery is non-empty, the NOT EXISTS in the outer query is not satisfied for this S.

AID Name DOB1 Jolie 4.6.75

SELECT *FROM Role RWHERE R.AID = 1

AID MID Name1 3 Fox


NOT EXISTS

35

• “Finally”, A is bound to the Actor tuple

• In the subquery, A.AID is “replaced by” 3 and the following query is executed:

• Since the result of the subquery is empty, the NOT EXISTS in the outer query is satisfied and Depp is output.

AID Name DOB3 Depp 9.6.1963

SELECT *FROM Role RWHERE R.AID = 3

AID MID Name(no rows selected)(no rows selected)(no rows selected)


NOT EXISTS

36

• While in the subquery tuple variables from the outer query may be referenced, the converse is illegal:

• Compare this to variable scoping (global/local variables) in block-structured programming languages (Java, C). Subquery tuple variables declarations are “local.”

Wrong!

SELECT A.Name, R.NameFROM Actor AWHERE NOT EXISTS (SELECT * FROM Role R WHERE R.AID = A.AID)


NOT EXISTS

37

• Non-correlated subqueries with NOT EXISTS are almost always an indication of an error:

• If there is at least one tuple in Role, the overall result will be empty.

• Non-correlated subqueries evaluate to a relation constant and may make perfect sense (e.g., when used with (NOT) IN).

Wrong!

SELECT A.NameFROM Actor AWHERE NOT EXISTS (SELECT * FROM Role R)


NOT EXISTS

38

• Note: it is legal SQL syntax to specify an arbitrarily complex SELECT clause in the subquery, however, this does not affect the existential semantics of NOT EXISTS.

SELECT * ... documents this quite nicely. Some SQL developers prefer SELECT 42 ... or SELECT null ... or similar SQL code.

• Again, the query optimizer will know the NOT EXISTS semantics such that the exact choice of the SELECT clause is of no importance.

• It is legal SQL syntax to use EXISTS without negation:

Who has starred in at least one movie?

SELECT NameFROM Actor AWHERE EXISTS (SELECT * FROM Role R WHERE A.AID = R.AID)

Can we reformulate this query without using EXISTS?


“For All”

39

• In natural language: “A movie M1 starring Actor 1 is selected if there is no movie M2 for this actor with a higher rating than M1.”

Which movie starring Actor 2 obtained the highest rating?

SELECT TitleFROM Role R1, Movie M1WHERE R1.MID = M1.MIDAND R1.AID = 2AND NOT EXISTS (SELECT * FROM Movie M2 WHERE M2.Rating > M1.Rating)


“For All”

40

• Mathematical logic knows two quantifiers:

• ∃ X(φ)! existential quantifier

There is an X that satisfies formula φ.

• ∀ X(φ)! universal quantifier

For all X, formula φ is satisfied (true).

• In tuple relational calculus (TRC) the maximum rating for movies starring Actor 1 is

{M1.Title | M1 ∈ Movie ∧ ∀ M2 ( M2 ∈ Movie ⇒ M2.Rating " M1.Rating)}


“For All”

41

• For database queries, the pattern ∀ X (φ1 ⇒ φ2 ) is very typical.

Such a bounded quantifier (X is required to fulfill φ1) is natural because TRC requires that all queries are safe: values outside the DB state do not influence the truth values. This is important to ensure that the truth value of a formula for some DB state may be determined with finite effort (it suffices to take into account the values that actually occur in database relations).

Usually, formula φ1 binds X to (some subset of) the tuples of a database relation. For these tuples φ2 then checks further query conditions.


“For All”

42

• SQL does not offer a universal quantifier (but only the existential quantifier EXISTS).

However, see >= ALL further below.

• Of course, this is no problem because

∀ X(φ) ⇔ ¬∃ X(¬φ)

• The above query pattern thus becomes (SQL lacks ⇒)

¬∃ X (φ1 ∧ ¬φ2)


“For All”

43

• The last example above is thus logically equivalent to

{M1.Title | M1 ∈ Movie ∧ ¬∃ M2 ( M2 ∈ Movie ⇒ M2.Rating > M1.Rating)}

Which movie starring Actor 2 obtained the highest rating?

SELECT TitleFROM Role R1, Movie M1WHERE R1.MID = M1.MIDAND R1.AID = 2AND NOT EXISTS (SELECT * FROM Movie M2 WHERE M2.Rating > M1.Rating)


Common Errors

44

Does this query compute the movie with the highest rating for Actor 2?

SELECT DISTINCT NameFROM Role R, Movie M1, Movie M2WHERE R.AID = 2AND M1.Rating >= M2.RatingAND R.MID = M1.MID

If not, what does the query compute?


Common Errors

45

• Subqueries bring up concepts of variable scoping (just like in programming languages) and related pitfalls.

Return those movies that did not star actor 2.

SELECT TitleFROM Movie MWHERE NOT EXISTS (SELECT * FROM Role R, Movie M WHERE R.MID = M.MID AND R.AID = 2)


Common Errors

46

What is the error in this query?

“Find those actors (AID) who have neither starred in any movie in 2006 nor in any movie in 2010”.

SELECT AIDFROM RoleWHERE MID NOT IN (SELECT MID FROM Movie WHERE Year = 2006 OR Year = 2010)

1. Is this syntactically correct SQL?

2. What is the output of this query?

3. If the query is faulty, correct it.


Notes

47


ALL, ANY, SOME

48

• SQL allows to compare a single value with all values in a set (computed by a subquery).

• Such comparisons may be universally (ALL) or existentially (ANY) quantified.

• Note: usage of >= is important here.

Which actor(s) starred in 2006 movies that got highest rating?

SELECT AIDFROM Movie M, Role RWHERE Role.MID = M.MID AND M.Year = 2006AND M.Rating >= ALL (SELECT M2.Rating FROM Movie M2 WHERE M2.Year = 2006)


ALL, ANY, SOME

49

• The following is equivalent to the above query:

• Note that ANY (ALL) do not extend SQL’s expressiveness, since, e.g.

A < ANY (SELECT B FROM ... WHERE ...) ≡

EXISTS (SELECT 1 FROM ... WHERE ... AND A < B)

Using ANY

SELECT TitleFROM Movie M, Role RWHERE Role.MID = M.MID AND M.Year = 2006AND NOT M.Rating < ANY (SELECT M2.Rating FROM Movie M2 WHERE M2.Year = 2006)


ALL, ANY, SOME

50

• Syntactical remarks on comparisons with subquery results:

(1) ANY and SOME are synonyms.

(2) x IN S is equivalent to x = ANY S.

(3) The subquery must yield a single result column.

(4) If none of the keywords ALL, ANY, SOME are present, the subquery must yield at most one row.

With (3) , this ensures that the comparison is performed between atomic (non-set) values. An empty subquery result is equivalent to NULL.


Single Value Subqueries

51

• Comparisons with subquery results (note: no ANY, ALL) are possible if the subquery returns at most one row.

[Why is this guaranteed here?] Use (non-data dependent) constraints to ensure this condition, the DBMS will yield a runtime error if the subquery returns two or more rows.

What movie got the same rating as Movie 1?

SELECT TitleFROM MovieWHERE Rating = (SELECT Rating FROM Movie WHERE MID = 1)


Single Value Subqueries

52

• If the subquery has an empty result, the null value is returned.

Bad style!

SELECT TitleFROM MovieWHERE (SELECT Rating FROM Movie WHERE MID = 1) IS NULL


Subqueries under FROM

53

• Since the result of an SQL query is a table, it seems most natural to use a subquery result wherever a table might be specified, i.e., in the FROM clause.

• This principle of (query) language construction is known as orthogonality: language constructs may be combined in arbitrary fashion as long as the semantic/typing/. . . rules of the language are obeyed.

Relational algebra is an orthogonal query language.

• SQL versions up to SQL-86 were not orthogonal in this sense.



54

• One use of subqueries under FROM are nested aggregations (see further below).

• In the following example, the join of Movie and Role is computed in a subquery.

Join of Movie, Role, and Actor using a subquery in the FROM clause.

SELECT X.Title, A.NameFROM (SELECT M.Title, R.AID, R.Name FROM Movie M, Role R WHERE M.MID = R.MID ) X, Actor AWHERE A.AID = X.AID



55

• Inside the subquery, tuple variables introduced in the same FROM clause may not be referenced.

This is for historic reasons and could be considered an SQL “bug.”

Not allowed in SQL!

SELECT X.Title, A.NameFROM Actor A, (SELECT M.Title, R.AID, R.Name FROM Movie M, Role R WHERE M.MID = R.MID AND A.AID = R.AID ) X



56

• A view declaration registers a query (not a query result!) under a given identifier in the database:

• In queries, views may be used like stored tables:

• Views may be thought of as subquery macros.

View: MovieRoles

CREATE VIEW MovieRoles ASSELECT M.Title, R.AID, R.NameFROM Movie M, Role RWHERE M.MID = R.MID

SELECT X.Title, A.NameFROM MovieRoles X, Actor AWHERE A.AID = X.AID


Chapter 6SQL



• Aggregation


• ORDER BY



Aggregations

58

• Aggregation functions are functions from a set or multiset to a single value, e.g.,min{42,57,5,13,27} = 5 .

• Aggregation functions are used to summarize an entire set of values.

In the DB literature, aggregation functions are also known as group functions or column functions: the values of an entire column (or partitions of these values) form the input to such functions.

• Typical use: statistics, data analysis, report generation.


Aggregations

59

• SQL-92 defines the five main aggregation functions

COUNT, SUM, AVG, MAX, MIN

• Some DBMS define further aggregation functions:

CORRELATION, STDDEV, VARIANCE, FIRST, LAST, . . .

• Any commutative and associative binary operator with a neutral element can be extended (“lifted”) to work on set-valued arguments (e.g., SUM corresponds to +).

Commutative and associative, neutral element?

Why do we require these properties of the operators?


Aggregations

60

• Note: some aggregation functions are sensitive to duplicates (e.g., SUM, COUNT, AVG), some are insensitive (e.g., MIN, MAX).

• For the first type, SQL allows to explicitly request to ignore duplicates, e.g.:

...COUNT(DISTINCT A)...

• Simple aggregations feed the value set of an entire column into an aggregation function.

Below, we will discuss partitioning (or grouping) of columns.


Simple Aggregations

61

How many movies are in the current database state

SELECT COUNT(*) AS COUNTFROM Movie

COUNT5

Best and average movie rating since 2000

SELECT MAX(Rating) AS M, AVG(Rating) AS AFROM Movie WHERE YEAR > 2000

M A8 5.5

How many distinct titles are stored in the current state of the movie table

SELECT COUNT(DISTINCT TITLE) AS COUNTFROM Movie

COUNT4


Simple Aggregations

62

What percentage of the maximum movie rating (= 10) did movie 3 reach?

What is the minimum, maximum, and average movie rating of movies staring Pitt?


Aggregation Queries

63

• Basically, there three different types of queries in SQL:

1. “No aggregation”: Queries without aggregation functions and without GROUP BY and HAVING.

2. “Simple aggregation” Queries with aggregation functions in the SELECT clause but no GROUP BY (simple aggregations). Yield exactly one row.

3. Queries with GROUP BY.

• Each type has different syntax restrictions and is evaluated in a different way.

Again: we refer to the SQL semantics here. A DBMS is free to implement these semantics as it sees fit.


Evaluation

64

1. First, evaluate the FROM clause.

Conceptually, form all possible tuple combinations of the source tables (Cartesian product).

2. Evaluate the WHERE clause.

The Cartesian product produced in (1) is filtered (restricted) and only those tuple combinations satisfying the filter condition remain.

3. If no aggregation, GROUP BY or HAVING: evaluate the SELECT clause.

Evaluate projection list (terms, scalar expressions) for each tuple combination produced in (2) and print resulting tuples.


Evaluation

65

4. Else if simple aggregation: add column values received from phase (2) to sets/multisets that will be the input to the aggregation function(s).

• If no DISTINCT is used or if the aggregation function is idempotent (MIN, MAX), the aggregation results may be incrementally computed, no temporary sets need to be maintained (see next slide).

• Print the single row of aggregated value(s).


Evaluation

66

Simple aggregation, no DISTINCT

SELECT SUM(Rating) AS M, COUNT(*) AS CFROM Movie WHERE YEAR > 2000

Possible eval strategy (no intermediate storage required)

sum := 0;count := 0;

foreach M in Movies do if M.Year > 2000 then sum := sum + M.Rating; count := count + 1; fiod

print sum, count;


Restrictions

67

• Aggregations may not be nested (makes no sense).

• Aggregations may not be used in the WHERE clause:

• If an aggregation function is used and no GROUP BY is used (simple aggregation), no attributes may appear in the SELECT clause.

But see GROUP BY below

Wrong!

... WHERE COUNT(A) > 10 ...

Wrong!

SELECT MID, AVG(Rating)FROM MOVIE


Null Values and Aggregations

68

• Usually, null values are ignored (filtered out) before the aggregation operator is applied.

Exception: COUNT(*) counts null values (COUNT(*) counts rows, not attribute values).

• If the aggregation input set is empty, aggregation functions yield NULL.

Exception: COUNT returns 0.

This seems counter-intuitive, at least for SUM (where users might expect 0 in this case). However, this way a query can detect the difference between two types of empty input: (1) all column values NULL, or (2) no tuple qualified in WHERE clause.


GROUP BY

69

• SQL’s GROUP BY construct partitions the tuples of a table into disjoint groups. Aggregation functions may then be applied for each tuple group separately.

Average number of movies per actor (actually MIDs per AID)

SELECT AID, COUNT(*) AS CFROM RoleGROUP BY AID

All tuples agreeing in the AID values (i.e., belonging to the same actor) for a group for aggregation.

AID C1 1

2 2


GROUP BY

70

• The GROUP BY clause partitions the incoming tuples (after evaluation of the FROM and WHERE clauses) into groups based on value equality for the GROUP BY attributes:

• Aggregation is subsequently done for every group (yielding as many rows as groups).

AID-based groups formed by above example query

AID MID Name

1 3 Fox

2 1 Richard Jones

2 2 Lt. Aldo Raine


GROUP BY

71

• This construction can never produce empty groups (a COUNT(*) will never result in 0).

See outer joins for queries that deal with “empty groups.”

• Since the GROUP BY attributes have a unique value for every group, these attributes may be used in the SELECT clause.

A reference to any other attribute is illegal.

Wrong! Although MID is key and thus E.Year would be unique.

SELECT MID, Year, AVG(Rating)FROM MOVIEGROUP BY MID

Grouping by MID and YEAR is possible and will yield the desired result:

SELECT MID, Year, AVG(Rating)FROM MOVIEGROUP BY MID, YEAR


GROUP BY

72

Is there any semantical difference between these queries?

SELECT AID, AVG(Rating)FROM Movie, RoleWHERE AID = 2 AND Movie.MID = Role.MIDGROUP BY AID

SELECT AID, AVG(Rating)FROM Movie, RoleWHERE AID = 2 AND Movie.MID = Role.MIDGROUP BY AID, YEAR


Notes

73


GROUP BY

74

• The sequence of the GROUP BY attributes is not important.

• Grouping makes no sense if the GROUP BY attributes contain a key (if only one table is listed in the FROM clause): each group will contain a single row only.

• Duplicates should be eliminated with DISTINCT, although such elimination can also be realized via GROUP BY:

• This is an abuse of GROUP BY and should be avoided.

Grouping without aggregation = DISTINCT

SELECT MID, YearFROM MOVIEGROUP BY MID, YEAR


HAVING

75

• Remember: aggregations may not be used in the WHERE clause.

• With GROUP BY, however, it may make sense to filter out entire groups based on some aggregated group property.

For example, only groups of size greater than n tuples may be significant.

• This is possible with SQL’s HAVING clause.

The condition in the HAVING clause may (only) involve aggregation functions and the GROUP BY attributes.


HAVING

76

• The WHERE clause refers to single tuples, the HAVING condition applies to entire groups (in this case: all role tuples of an actor).

• If a condition refers to GROUP BY attributes only (but not aggregations), it may be placed under WHERE or HAVING.

Which actors stared in at least 2 movies?

SELECT AIDFROM RoleGROUP BY AIDHAVING COUNT(Name) >= 2

AID

2

COUNT vs. HAVING

SELECT AIDFROM RoleGROUP BY AIDHAVING AID = 1 AND COUNT(Name) >= 2

SELECT AIDFROM RoleWHERE AID = 1GROUP BY AIDHAVING COUNT(Name) >= 2


Aggregation Subqueries

77

• The aggregate in the subquery is guaranteed to yield exactly one row as required.

• Remember: our earlier solution to this problem was using ANY/ALL.

Which actor(s) starred in 2006 movies that got highest rating?

SELECT AIDFROM Movie M, Role RWHERE Role.MID = M.MID AND M.Year = 2006AND M.Rating = (SELECT MAX(Rating) FROM Movie )


Aggregation Subqueries

78

• In SQL-92, aggregation subqueries may be placed into the SELECT clause. This may replace GROUP BY.

The average rating for each actor

SELECT AID, (SELECT AVG(Rating) FROM Role R2, Movie M WHERE R2.AID = R.AID ) AS AvgRatingFROM Role R


Nested Aggregations

79

• Nested aggregations require a subquery in the FROM clause.

What is the minimum average rating of movies within a year?

SELECT MIN(Year_Rating)FROM (SELECT YEAR, AVG(Rating) AS Year_Rating FROM MOVIE GROUP BY YEAR) X


Maximizing Aggregations

80

• Alternatively, we could use a view to solve this problem (see next slide).

What actor has the best average rating for his/her movies?

SELECT A.AID, A.Name, AVG(Rating) AS AvgRatingFROM Actor A, Role R, Movie MWHERE A.AID = R.AID AND R.MID = M.MIDGROUP BY A.AID, A.NameHAVING AVG(Rating) >= ALL (SELECT AVG(Rating) FROM Role R, Movie M WHERE R.MID = M.MID GROUP BY A.AID)


Maximizing Aggregations

81

View: average movie rating for each actor (who stars in at least one movie)

CREATE VIEW ActorRatings ASSELECT AID, AVG(Rating) AS AvgRating FROM Role R, Movie MWHERE R.MID = M.MIDGROUP BY A.AID

Alternative formulation of query on previous slide

SELECT A.AID, A.Name, AR.AvgRatingFROM Actor A, ActorRatings AR, Movie MWHERE A.AID = AR.AID AND AR.AvgRating = (SELECT MAX(AvgRating) FROM ActorRatings)


Chapter 6SQL



• Aggregation


• ORDER BY



UNION

83

• In SQL, it is possible to combine the results of two queries by UNION.

• UNION is strictly needed since there is no other method to construct one result column that draws from different tables/columns.

This is necessary, for example, if specializations of a concept (“subclasses”) are stored in separate tables. For instance, there may be ActionMovie and AnimationMovie tables (both of which are specializations of the general concept Movie).

• UNION is also commonly used for case analysis (cf., the if · · · then · · · cascades in programming languages).


UNION

84

• The UNION operand subqueries must return tables with the same number of columns and compatible data types.

• Columns correspondence is by column position (1st, 2nd, . . . ). Column names need not be identical (IBM DB2, for example, creates artificial column names 1, 2, . . . , if necessary. Use column renaming via AS if column names matter.

• SQL distinguishes between

‣ UNION: like RA ∪ with duplicate elimination, and

‣ UNION ALL: concatenation (duplicates retained).

• Other SQL-92 set operations: EXCEPT (#), INTERSECT (∩). These do not add to the expressivity of SQL (redundant operators).


UNION

85

Total number of movies per actor (or 0 if actor did not star in any movie)

SELECT A.Name, COUNT(R.MID) AS MovieCountFROM Actor A, Role RWHERE A.AID = R.AIDUNION ALLSELECT A.Name, 0 AS MovieCountFROM Actor AWHERE A.AID NOT IN (SELECT AID FROM Role)

Assign movie classes (Favorites and Good movies) based on rating

SELECT Title, ‘Favorites’ AS MovieClassFROM MovieWHERE Rating >= 8UNION ALLSELECT Title ‘Good movies’ AS MovieClassFROM MovieWHERE Rating > 5 AND Rating < 8


Conditional Expressions

86

• Whereas UNION is the portable way to conduct a case analysis, sometimes a conditional expression suffices and is more efficient.

Here, we will use the SQL-92 (and, e.g., DB2) syntax. Conditional expression syntax varies between DBMSs. Oracle uses DECODE( · · · ), for example.

Assign movie classes (Favorites and Good movies) based on rating

SELECT CASE WHEN Rating >= 8 THEN ‘Favorites’ WHEN Rating > 5 AND Rating < 8 THEN ‘Good movies’ ELSE ‘Unknown Category’ END MovieClass, TitleFROM Movie

What is the difference in terms of the output to query on previous slide?


Conditional Expressions

87

• A typical application of a conditional expression is to replace a null value by another (non-null) value Y :

• In SQL-92, this may be abbreviated to

• Conditional expressions are regular terms, so they may be input for other functions, comparisons, or aggregate functions.

... CASE WHEN X IS NOT NULL THEN X ELSE Y END ...

... COALESCE(X, Y) ...

Return names and DOB of actors, and an ‘unknown’ value if DOB unspecified

SELECT AID, COALESCE(DOB, ‘unknown’) FROM ACTOR

AID Name DOB

1 Jolie 4.6.75

2 Pitt 18.12.63

3 Depp null

ActorAID DOB

1 4.6.75

2 18.12.63

3 unknown’


Chapter 6SQL



• Aggregation


• ORDER BY



Sorting Output

89

• If query output is to be read by humans, enforcing a certain tuple order greatly helps in interpreting the result.

Without such an ordering, the sequence of output rows is meaningless, depends on the internal algorithms selected by the query optimizer to evaluate the query and may change from version to version of even query to query.

• In a SQL RDBMS, however, the query logic and the subsequent output formatting are completely independent processes.

DBMS front-ends offer a variety of formatting options (page breaks, colorization of column values, etc.).


Sorting Output

90

• Specify a list of sorting criteria in an ORDER BY clause.

An ORDER BY clause may specify multiple attribute names. The second attribute is used for tuple ordering if they agree on the first attribute, and so on (lexicographic ordering).

• The second sort criterion (Rating DESC) is used, when the first criterion (Year) leads to a tie.

‣ Sort in ascending order (default): ASC

‣ Sort in descending order: DESC

Return names and DOB of actors, and an ‘unknown’ value if DOB unspecified

SELECT Year, Rating, MID, TitleFROM MovieORDER BY Year, Rating DESC

Year Rating MID Title

1960 5 4 King Kong

2006 7 1 Babel

2006 4 5 King Kong

2008 3 3 Wanted

2009 8 2 Inglorious Bastards


Sorting Output

91

• In some application scenarios it is necessary to add columns to a table to obtain suitable sorting criteria.

Examples:

• In a list of Movies, ‘The Machinist’ should be listed under M, not T.

• Assuming actor names were stored in the form ‘Brad_Pitt’, sorting by last name is more or less impossible.

This is related to an important DB design time question: “What do I need to do with the query outputs?”


Sorting Output

92

• Null values are all listed first or all listed last in the sort sequence (IBM DB2: all first).

• Since the effect of ORDER BY is purely “cosmetic”, ORDER BY may not be applied to a subquery.

• This also applies if multiple queries are combined via UNION. Place ORDER BY at the bottom of the query to sort all tuples.


Chapter 6SQL



• Aggregation


• ORDER BY



Joins in SQL-92

94

• Up to version SQL-86, SQL users were not able to explicitly specify joins in queries. Instead, Cartesian products of relations (FROM) are specified and then filtered via WHERE.

• The query optimizer automatically derives the opportunity to deploy one of the several join variants (declarative query language).

Natural join of Role and Movie

SELECT * FROM Movie, RoleWHERE Movie.MID = Role.MID


Joins in SQL-92

95

• In SQL-92, users may write, e.g.,

• The keywords NATURAL JOIN lead the DBMS to automatically add the join predicate Movie.MID = Role.MID to the query.

• SQL-92 permits joins in the FROM clause as well as at the outermost query level (like UNION). This comes close to relational algebra.

“Natural join” in SQL-92

SELECT * FROM Role NATURAL JOIN Movie


Outer Join (recap)

96

•The left outer join preserves all tuples in its left argument, even if a tuple does not team up with a partner in the join:

•The right outer join preserves all tuples in its right argument:

•The full outer join preserves all tuples in both arguments:

A Ba1 b1

a2 b2

B Cb2 c2

b3 c3

A B Ca1 b1 (null)

a2 b2 c2⟕ =

A Ba1 b1

a2 b2

B Cb2 c2

b3 c3

A B Ca2 b2 c2

(null) b3 c3=⟖

A Ba1 b1

a2 b2

B Cb2 c2

b3 c3

A B Ca1 b1 (null)

a2 b2 c2

(null) b3 c3

=⟗


Outer Join in SQL-92

97

• All actors are present in the result of the left outer join.

• For actors without associated (joining) movies, columns AID, MID, and Role will contain NULL.

• COUNT(MID) ignores rows where MID IS NULL.

Number of movies per actor (if no movie, return 0)

SELECT A.AID, COUNT(MID) FROM Actor A LEFT OUTER JOIN Role R ON A.AID = R.AIDGROUP BY A.AID



98

Equivalent query without OUTER JOIN

SELECT A.AID, COUNT(MID) FROM Actor, Role RWHERE A.AID = R.AIDGROUP BY A.AIDUNION ALLSELECT A.AID, 0FROM Actor AWHERE A.AID NOT IN (SELECT AID FROM Role)



99

Is there a problem with the following query?

“Number of roles per movie (including 0) staring actor 1”

SELECT Title, COUNT(AID)FROM Movie M LEFT OUTER JOIN Role R ON M.MID = R.MIDWHERE R.AID = 1GROUP BY M.MID, Title



100

• It is generally wise to restrict the outer join inputs before the outer join is performed (or move restrictions into the ON clause).

Corrected version of last query

SELECT Title, COUNT(AID)FROM Movie M LEFT OUTER JOIN ( SELECT MID, AID FROM Role WHERE AID = 1) R ON M.MID = R.MIDGROUP BY M.MID, Title



101

Will tuples with years other than 2006 appear in the output?

SELECT M.Title, M.Year, R.NameFROM Movie M LEFT OUTER JOIN Role R ON M.MID = R.MID AND M.YEAR = ‘2006’

• Conditions filtering the left table make little sense in a left outer join predicate.

• The left outer join semantics will make the “filtered” tuples appear anyway (as join partners for unmatched Role tuples).



102

• MySQL (prior to V4.1), for example, does not support subqueries. Sometimes, outer join can make up for this.

• Instead of R.MID any attribute of Role which may not be null (e.g., a part of a key) could be tested for being NULL.

• The IS NULL test finds those Actor tuples which do not find a matching join partner.

Actors who did not star in any movie

SELECT A.AID, A.NameFROM Actor A LEFT OUTER JOIN Role R ON A.AID = R.AIDWHERE R.MID IS NULL


Join Syntax in SQL-92

103

• SQL-92 supports the following join types (parts in [ ] optional):

‣ [INNER] JOIN:$ Usual join.

‣ LEFT [OUTER] JOIN: Preserves rows of left table.

‣ RIGHT [OUTER] JOIN: Preserves rows of right table.

‣ FULL [OUTER] JOIN: Preserves rows of both tables.

‣ CROSS JOIN:$ Cartesian product.

‣ UNION JOIN:$ Pads columns of both tables with NULL (SQL-92 Intermediate level, rarely found implemented in DBMSs).

• Note: semi-join is easily expressed in the SELECT clause (SELECT DISTINCT R.* FROM R,S WHERE C).


Join Syntax in SQL-92

104

• The join predicate may be specified as follows:

‣ Keyword NATURAL prepended to join operator name.

‣ ON ⟨Condition⟩ appended to join operator name.

‣ USING (A1,...,An) appended to join operator name.

USING specified columns Ai appearing in both join inputs R and S. The effective join predicate then is R.A1 = S.A1 AND ··· AND R.An = S.An.

• CROSS JOIN and UNION JOIN have no join predicate.

Union Join not implemented in today’s DBMSs?

Simulate R UNION JOIN S


Notes

105


Summary

• SELECT-FROM-WHERE blocks, do not forget tuple variables

• Subqueries are possible, important keywords are NOT EXISTS, NOT IN, ALL, ANY

• Aggregate queries may use aggregate functions (AVG, SUM, COUNT, MIN, MAX), GROUP BY, and HAVING clauses.

• UNION and conditional expressions

• ORDER BY clause for sorting

• Joins in SQL-92

106

database systems i - uni-tuebingen.de

Documents