introduction to data science, week 2 notes...introduction to data science, week 2 notes 2013...

0

of 7

Danny King

[email protected]

www.dannyking.eu

Introduction to Data Science, Week 2 Notes

Danny King, 2013


Different types of database structure/organisation have benefits or weaknesses for different types of

activities (e.g. tradeoffs between read and write speeds). When choosing the data storage mechanism you

should keep in mind: how is the data organised on disk? What kinds of queries are best suited to the

organisation (and what aren’t)? How hard is it to update the data/add new data? What happens when

you encounter unanticipated queries – do you need to

reorganise the data and how hard is that?

The key part of relational databases is that the data is

independent of the application code – activities of users and

applications should remain unaffected when the internal

representation of data is changed. Tabular data has an

algebraic structure which allows reasoning and manipulation

independently of physical data representation. Programs you

write to manipulate data are more robust than they would be

without the relational model.

SQL is converted into relational algebra by relational DBMSs so

that they can use algebraic simplification to optimise the

equation and eliminate unnecessary tasks, making the computation more efficient, which is especially

important over large data sets. Algebraic closure in this context means that any operation (on a table) will

return a table so you can chain operations.

Sets can’t have duplicate elements and bags can.

Union bags Union sets

0

of 7

Danny King

[email protected]

www.dannyking.eu


Danny King, 2013

Intersection

e.g.:

Projection: remove all columns that aren’t explicitly listed and remove all duplicates that remain (for sets)

Bag semantics

Bag semantics

0

of 7

Danny King

[email protected]

www.dannyking.eu


Danny King, 2013

Cross product: all possible combinations of R1 with R2. Size of a cross product is |R1| x |R2|

Equijoin is a join with an equality condition. Join is: for every record in R1 find a corresponding record in R2

that satisfies some condition. Theta join is simply a join where the condition can be anything (so equijoin is a

special case of theta join where the condition is equality).

0

of 7

Danny King

[email protected]

www.dannyking.eu


Danny King, 2013

When dealing with understanding a

complicated query, look for the FROM

clause. In this case, there are 3 nested

FROM clauses (closure allows this –

operating on new tables returned by

each query).

You can save any of these statements

as a view (i.e. give it a name), then you

can refer to it as if it were a table.

Sometimes it helps to draw out complex

conditions graphically to work out what is

happening.

There are three types of user-defined functions

(UDFs).

Aggregate functions only appear in the select

clause and is always associated with a GROUP BY

– a common example is a concatenate function.

Table functions appear in the from clause –

usually the most complicated – commonly they

are used to generate sequence of integers (1, 2,

..., 4, 5) in a temporary table.

Scalar functions can appear pretty much

anywhere an attribute can – e.g. if in a SELECT

clause, a WHERE clause, in a JOIN condition, etc.

0

of 7

Danny King

[email protected]

www.dannyking.eu


Danny King, 2013

Declarative languages: you specify the answer you want but don’t specify anything about how to get it.

Relational Algebra expressions are not declarative (you do specify an order) but SQL is. With SQL you just

specify the properties that must be true of the result and you let the database figure out the right way to

compute it. The SQL is the ‘what’ not the ‘how.’

Logical Independence

Physical independence is the ability to isolate/protect the application from physical changes in the

organisation of the data. So if things were rearranged on disk, we don’t

have to rewrite the application. This is what databases provide.

Logical data independence is the ability to isolate/protect the

application form changes to the logical structure of the database –

e.g. adding a new column. If your application doesn’t care about that

column it shouldn’t have to be re-written. This logical data

independence is provided by views (all relational databases have this

concept).

This allows us to separate the people who are administering the

database from the ones who are actually accessing it.

A view is just a query with a name (saved in the

database) which you can access as if it were a

real table.

No matter how you want to organise your

tables views allows you to provide a different

perspective on the data (e.g. different

structure, different column names) for users or

other queries to use.

Access views exactly as if they were a table.

The database pulls all these together and works out the

best way to deal with them all in one go, so you don’t need

to worry about having a stack of queries or views – the

database will translate the whole thing into one big query.

0

of 7

Danny King

[email protected]

www.dannyking.eu


Danny King, 2013

Indexes allow you to find needles in the

haystack of database data.

Queries will always finish regardless of

database size as long as it fits on disk: you

don’t need to worry about amount of

memory(RAM).

Written Notes

0

of 7

Danny King

[email protected]

www.dannyking.eu


Danny King, 2013

introduction to data science, week 2 notes...introduction to data science, week 2 notes 2013...

Documents