databases - gitlab · an sql versus a nosql database system the short summary for this unit is...

105
Databases A dynamic web site needs permanent data, and a database is usually the place to store it choices embedded nosql sql injection scale 1

Upload: others

Post on 27-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Databases

A dynamic web site needs permanent data, and adatabase is usually the place to store it

choices embedded nosql sql injection scale

1

Page 2: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

The choicesThere are two main choices to make:

an embedded versus a served database systeman SQL versus a NoSQL database system

The short summary for this unit is going to be that youshould choose an embedded database system, and youshould choose an SQL database system

Don't just follow fashion and beware that textbookscopy each other and are full of incredibly obsoletegarbage

2

Page 3: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

FashionMost under-educated amateurs choose database systemsaccording to fashion, i.e. they follow the crowd andcopy stuff out of tutorials without knowing why itworks or whether it is the right choice

As ever, tutorials show you how but don't tell you why

As with with all issues in this course, you should try tounderstand more deeply, and make informed choices

2a

Page 4: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Embedded versus servedA database is stored in a file (or sometimes a set of files)

Only one process/thread should access the file ( sqlite)

In embedded databases, this is the web server, and inserved databases, it is a separate database server:

3

Page 5: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Sqlite sharingSqlite is rare in allowing multiple processes/threads toaccess the same file

It uses whole-file locking, which is limited in itsefficiency in highly concurrent situations, so youprobably shouldn't use this feature

It is extremely unfair to criticize sqlite for this

3a

Page 6: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Choosing embedded/servedAn embedded database is simpler, easier to run,and more efficient (no extra network overhead fortransporting data)

A served database allows multiple web serverprocesses/threads to share the same database

you should choose embedded for easy markingupgrading to a served database to scale up is easyserved databases aren't always best for scaling up

Maintenance is an issue, and scaling up is complex -see later

4

Page 7: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Embedded is simplerA served database provides a complex protocol forhandling queries over a network connection

This protocol is not standardized - it is different foreach database system

On top of that, there is usually a system for caching thedata in the web server, so there is complexity at bothends

4a

Page 8: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Embedded is easierWith a served database, you have to manage two servers(web and db), and worry about the configuration,especially security, of both servers

In a commercial setting, there are extra operationalproblems (what happens if one server goes down, or ifthe web server comes up before the database server isready to accept queries)

For marking, I am not prepared to run two servers tomark each submitted web site

4b

Page 9: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Embedded is fasterWith a served database, each network request to theweb server may result in another network request tothe database server

This can effectively 'double' the network load, andnetwork load is often the efficiency bottleneck

When the two servers are running on the samecomputer, there should be no actual network traffic, butthere are still overheads with using the query protocol

4c

Page 10: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

MaintenanceWith an embedded database, you can't normally use acommand line tool to update the database while theserver is running (with a served database, the tool canconnect to the running server)

That's because two processes can't access the samedatabase file

Either you have to shut down the server, or you canprovide a privileged web page for data maintenance onyour site

4d

Page 11: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

NoSQL databasesNoSQL databases are simpler than SQL databases, so let'sdeal with them first

They are special purpose databases

There are different types for different special purposes

The most common type is the 'document store'

MongoDB, the current fashion in JavaScript, is adocument store

object oriented databases

5

Page 12: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Object oriented databasesThe idea of an object oriented database is that it lives inmemory as a collection of (normal) objects

There is no overhead in fetching the data or convertingit into object format

It is stored on disk like a write-through cache to makethe data permanent (so it survives a server restart)

NoSQL database systems are a long way from being OOdatabase systems, and OO database systems are not verypopular or successful anyway - see next few slides

5a

Page 13: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

NoSQL is not OONoSQL databases are not OO because they lack severalfeatures of conventional SQL databases which are wellknown, and which need to be implemented in memoryto form a proper database

These are concurrency (thread-safety and thread-efficiency), transactions (ACID consistency), and indexes(to speed up searches)

Roughly speaking, these require versioning (for readconsistency) and locking (for write efficiency)

5b

Page 14: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Why not OO databases?Why are OO database systems unpopular andunsuccessful?

One problem is that OO databases don't seem to fit wellwith refactoring

In classed languages like Java, if you update the classesthat the database objects belong to, the data becomesinvalid, so needs to be massaged before it can be usedagain

Also, OO databases can only be used by programmers;they make no sense to non-experts

5c

Page 15: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Future of OO databases?In my opinion, OO databases make a great deal of sense,and deserve to be re-thought

One thing that needs doing is to build OO databasesupport into the programming language (Java's JPAsystem, bolted on from the outside, is incrediblycomplex)

Another is that databases need to be included in theprogram development process, so that refactoring of thedata is done naturally alongside refactoring of the code

5d

Page 16: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

When does NoSQL work?NoSQL databases of the document store type work bestwhen the data consists of separate objects which mayhave internal structure, but which have no relationshipsor cross-references between them, and no importantconsistency requirements

For example, if you set up a shopping site, and your dataobjects are customers and products, everything is fine

But as soon as you add sales, you are in trouble, becausea sale object references a customer and a product

6

Page 17: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Eventual consistencyMongoDB is popular because it is simple, so it workswell for undereducated amateurs who don't know orcare about relationships or consistency

It can also work well for experts in very largedistributed projects if the data is suitable and if theyrealize that they (and not the database system) areresponsible for the implementation of consistency issues

These large projects require eventual consistency, thehardest kind to implement

7

Page 18: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

AdviceIn my opinion it is usually a mistake to start with aNoSQL database

That's because, by the time you realize that it isinadequate for your kind of data, it will probably beexceedingly difficult to upgrade to an SQL database

8

Page 19: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

NoSQL horror storiesHere are some things that go wrong with NoSQL:

relations (TV shows)transactions (bitcoins)resourcesqueriessecurity

You should not use MongoDB in this unit (for ease ofmarking, you should use an embedded system)

If you must use the MongoDB API (highly-non-recommended), then use something like TingoDB (anembedded version)

9

Page 20: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

TV showsIn one famous case, (a minor example within this story)MongoDB was used to set up a system for accessing TVshows online

MongoDb worked well, because each episode was a self-contained object, and there were no cross-references

Then, one day, an extra feature was requested - to beable to click on an actor listed in an episode, and see allthe other TV shows the actor was in

See the next two slides for more details

9a

Page 21: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

The dataFirst, there was a problem with the mass of data alreadycollected

Each actor was listed in different TV episodes usingslightly different strings - extra spaces, or full stops, orinitials versus first names, or ...

So how do you know when it is the same actor?

String comparison is not good enough

A lot of effort had to put into cleaning up the data,giving each actor a unique id

9b

Page 22: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

The outcomeAs well as cleaning up the data, the relationshipsbetween the actors and the episodes had to be set up

Although there are lots of claims about being able torepresent relationships in MongoDB, it isn't usuallyviable

So with a lot more effort the data was moved to an SQLdatabase, where it should have been in the first place

9c

Page 23: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

BitcoinsIn another case, a bitcoin site was set up

It is rather like a bank, with accounts, where atransaction consists of taking some money out of oneaccount and putting it into another

A system was set up using MongoDB, and then it wasdiscovered that the data was becoming inconsistent, i.e.money was evaporating or being generated out ofnowhere

See next slide for more details

9d

Page 24: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

TransactionsThe point is that the transfer of money betweenaccounts involves updating two objects

And the pair of updates together must act as anindivisible transaction, i.e. both updates must work orneither

MongoDB is completely incapable of making such aguarantee - the system had to be switched to an SQLdatabase

9e

Page 25: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

ResourcesSome people are horrified to discover that the files thatMongoDB creates are huge, or that the amount ofmemory it reserves when you start it up is huge

This is even if the MongoDB database is empty

It is because MongoDB is optimized for big data

It is a solvable problem, but why should you have tosolve it?

9f

Page 26: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

QueriesMany people find they have a need to make complexqueries on MongoDB objects

So, they add a module which allows them to make SQLqueries on MongoDB objects, without realizing howutterly stupid this is - MongoDB is a NoSQL database -the clue is in the name

What they are doing is handling relationships withMongoDB in a completely implicit and non-robust way -relationships require consistency guarantees

9g

Page 27: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

SecurityIt is very common to create an insecure system usingMongoDB

See survey

Or see study

This is a solvable problem, in fact the MongoDBsuppliers have been persuaded to improve the defaults,but the fundamental problem is that people generallydon't know or care about security - a professional musttake care over security

9h

Page 28: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

SQL databasesSQL databases are general purpose, and they are allessentially the same

They are also called relational databases because they fitthe theory of relations ( roughly)

It takes a little knowledge and skill to (a) fit your data tothe relational model, and (b) use the SQL language toextract data

But it is not rocket science (and if your data is trivialenough for MongoDB to be suitable, then the SQLapproach is easy too!)

10

Page 29: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

CriticismsSQL is not as relational as it should be - see criticismand the third manifesto

It produces results in implementation-dependent order,they may include duplicates, and problems arise withlimit clauses etc which depend on order, there is athree-value logic based on null which is incompleteand doesn't work, and so on

See notes on purifying it

10a

Page 30: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

customersusername nameab12345 Patcd67890 Chris

Pet storeA simple pet store database might have three tables, foranimals, customers, and sales:

animalsid breed42 dog53 fish

salesid username price42 ab12345 10053 cd67890 50

11

Page 31: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

RelationshipsEach row of a table is an object, each column is a field

The animals and customers are simple because theyhave no relationships

The sales table is more subtle, because it uses ids andusernames to represent relationships (cross-references)between the sales table and the other two tables, toavoid duplicating any information

12

Page 32: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

sql

SQLA statement in SQL to fetch all the animals objects is:

select * from animals

The result, in JavaScript, is a set of objects, where eachobject has fields id and breed

SQL is designed to be programming-language-neutral,i.e. it can be used inside C, C++, Java, JavaScript or anyother language, object oriented or not

It is also designed to be used as an interactive commandlanguage, so that databases can be handled manually,outside of the web server

13

Page 33: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

The SQL languageSQL is a narrow-purpose declarative scripting languagefor searching and sorting database records

SQL is supposed to be standardised across databasesystems, but (a) the standard is inaccessible and old andwoolly and (b) database systems don't always stick to it

The result is that SQL is reasonably standard as long itis a human who is moving from system to system, but ifyou move actual SQL statements from system to system,you should expect to have to adjust them

13a

Page 34: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

SQLiteThe recommended system for this unit is SQLite(pronounced s-q-l-lite or sequelite)

If the sqlite3 node module is installed, the callbackstyle has to be used, so these notes assume that thesqlite node module is installed, which uses promisesso that the async/await keywords can be used.

command line tools, why lite? documentationother SQL systems

14

Page 35: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Command line toolsInstead of accessing the database via a program withSQL strings in it, you can install a command line toolwhich allows you to type SQL commands interactively

You could use sqlite3 command (not the node module)or SQLite Browser or SQuirrel (which accesses any typeof database)

This can be useful to maintain the database, e.g.initialize it, but you can't safely access the database thisway while the web server is using it, so the usefulnessis limited

14a

Page 36: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Why lite?What does the 'lite' in SQLite refer to?

It means that SQLite small and simple, suitable for smalldevices such as mobiles, so what do you lose?

It is embedded and not served, it has loose types in theJavaScript style, some features have to be 'switched on',and you have to think hard about scaling up

But, contrary to rumour, it doesn't compromise onconcurrency (though Node is single-threaded anyway)

14b

Page 37: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

SQLite documentationMost database systems are written in C or C++,including SQLite (and MongoDb)

So, for SQLite, there are two parts to thedocumentation, one not language specific, one JavaScript

General SQLite documentation

Node sqlite3 module documentation

14c

Page 38: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Other SQL systemsSQL systems are all very similar, and the SQL languageis portable (with a few changes) between systems

Desirable features of an SQL database system are zeroconfiguration and robustness

See the next few slides for details

14d

Page 39: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

ConfigurationConfiguration doesn't mean things like adding an indexto a table for efficiency - any database system needsthat - it means things like creating data files, choosingfile sizes, choosing limits, choosing text encodings,switching on foreign key support, upgrading, ...

Configuring databases can be a nightmare - because ofcomplexity and/or poor documentation

Good modern designs just work, because all theconfiguration is automatic

14e

Page 40: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Robustness1) data must not be corrupted by a program crash,operating system crash, or power failure - use adatabase system which has had extensive crash testing

2) transactions should be serializable (any otherisolation level, such as read committed, is a fudge)

3) there should be automatic consistency checking, e.g.for foreign keys

Most common database systems are OK for robustness

14f

Page 41: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Stable storageBeware that no database system can guaranteerobustness unless running on stable storage

And normal stable storage (RAID) isn't perfect because(for efficiency) the hardware confirms that data hasbeen written when it hasn't - it is still in a hardwarecache waiting to be written

True robustness only comes from inefficientlyconfigured specialist hardware

14g

Page 42: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

OpinionsBased on experience, I would say:

Oracle: incredibly reliable, incredibly expensive andobsolete, requires a training course to configure

MySQL: incredibly popular, too proprietary,configuration and documentation very poor

MariaDb: newer open version of MySQL, better butconfiguration still poor

Postgresql: OK

Derby: excellent all round, but effectively Java only

14h

Page 43: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

js

Creating a databaseHere is a program to create a database file:

"use strict"; var sqlite = require("sqlite"); create(); async function create() { try { var db = await sqlite.open("./db.sqlite"); await db.run("create table animals (id, breed)"); await db.run("insert into animals values (42,'dog')"); await db.run("insert into animals values (53,'fish')"); var as = await db.all("select * from animals"); console.log(as); } catch (e) { console.log(e); } }

interactive database files semicolons sampledata meta commands scripts callback style

15

Page 44: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

sh

Interactive creationUsing the sqlite3 command:

> sqlite3 data.db > create table animals (id, breed); > insert into animals values (42, "dog"); > insert into animals values (53, "fish"); > select * from animals; > .quit

15a

Page 45: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Database filesWhen you open data.db, that says you want to usethe file data.db to hold the data

SQlite will create the file the first time, if it doesn't exist

If it does exist, SQLite will connect to it and give youaccess to all the data stored permanently in it

The file can be copied, renamed or moved as you like

15b

Page 46: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

SemicolonsWhen you type SQL commands interactively, you needsemicolons

Leaving out a semicolon indicates that you want tocontinue the command on the next line

When you use SQL commands in a program, you don'tneed semicolons because each command is a self-contained string

15c

Page 47: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Sample dataThe two insert commands are just there to insertexample data to try out SQL

You can delete the sample data with delete fromanimals; if you want to, before using the database inyour server

15d

Page 48: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Meta commandA meta command is a command provided for interactiveuse which goes beyond the SQL language

For example .exit is clearly never needed in aprogram

Other useful meta commands are .help, .tables,.headers on

Meta commands don't need semicolons at the end

The SQL command pragma foreign_keys = on;is useful for enabling automatic consistency checks

15e

Page 49: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

create.sql

ScriptsA script is a file containing SQL commands, e.g.

create table animals (id, breed); insert into animals values (42, "dog"); insert into animals values (53, "fish"); select * from animals;

This allows you to remember, edit and redo thecommands again later, if you feel the need to start againfrom scratch

You can redo by copying and pasting the commands, orwith .read create.sql

15f

Page 50: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

create.js

Callback styleIf you use the sqlite3 module, the create program is:

"use strict"; var sql = require("sqlite3"); var db = new sql.Database("data.db"); db.serialize(create); function create() { db.run("create table animals (id, breed)"); db.run("insert into animals values (42,'dog')"); db.run("insert into animals values (53,'fish')"); }

See next slide for serialize

15g

Page 51: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

js

Serialize methoddb.serialize(create); function create() { db.run("..."); db.run("..."); ... }

The serialize method calls a function in sequentialmode; the calls to run are chained, as if each had acallback function which triggered the next

serialize returns straight away, with the queriesstill outstanding, but automatically linked by callbacksso they happen in order

15h

Page 52: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

fetch.js

Fetching rows"use strict"; var sqlite = require("sqlite"); fetch(); async function fetch() { try { var db = await sqlite.open("./db.sqlite"); var as = await db.all("select * from animals"); console.log(as); } catch (e) { console.log(e); } }

[ { id: 42, breed: 'dog' }, { id: 53, breed: 'fish' } ]

closing callback style

16

Page 53: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Closing a databaseTutorials often show a call db.close() to close theconnection to the database

Don't do this in a server; open the database once at thestart, and keep it open for every request

You would only use db.close() if you haveimplemented a clean way to shut down the server; butmost people just use CTRL/C (which is OK, because thesystem does frequent commits)

16a

Page 54: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

fetch.js

Fetching rows"use strict"; var sql = require("sqlite3"); var db = new sql.Database("./db.sqlite"); db.all("select * from animals", show); function show(err, rows) { if (err) throw err; console.log(rows); }

[ { id: 42, breed: 'dog' }, { id: 53, breed: 'fish' } ]

16b

Page 55: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

single.js

Fetching a single row"use strict"; var sqlite = require("sqlite"); fetch(); async function fetch() { try { var db = await sqlite.open("./db.sqlite"); var as = await db.get("select * from animals where id=42"); console.log(as); } catch (e) { console.log(e); } }

{ id: 42, breed: 'dog' }

If you only want one object, you can use get instead ofall (to avoid a list with one entry)

callback style

17

Page 56: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

single.js

Callback style"use strict"; var sql = require("sqlite3"); var db = new sql.Database("./db.sqlite"); db.get("select * from animals where id=42", show); function show(err, row) { if (err) throw err; console.log(row); }

{ id: 42, breed: 'dog' }

17a

Page 57: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

js

js

Inserting, updating"use strict"; var sqlite = require("sqlite"); insert(); async function insert() { var db = await sqlite.open("./db.sqlite"); await db.run("insert into animals values (64,'cat')"); }

"use strict"; var sqlite = require("sqlite"); update(); async function update() { var db = await sqlite.open("./db.sqlite"); await db.run("update animals set breed='terrier' where id=42"); }

callback style

18

Page 58: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

js

js

Callback style"use strict"; var sql = require("sqlite3"); var db = new sql.Database("data.db"); db.run("insert into animals values (64,'cat')");

"use strict"; var sql = require("sqlite3"); var db = new sql.Database("data.db"); db.run("update animals set breed='terrier' where id=42");

18a

Page 59: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

sql

JoinsA statement in SQL to load up all the sales, togetherwith the data about each animal and customer, is:

select * from sales join animals using (id) join customers using (username)

The column name id must match in sales andanimals, and similarly for username

The result, in JavaScript, is a set of objects, where eachobject has fields price, id, breed, username, name

natural joins other alternatives

19

Page 60: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

sql

Natural joinsAn alternative is

select * from sales natural join animals natural join customers

This automatically finds the matching column names -so you have to be very careful about naming thecolumns correctly

In this case id is the only matching column in salesand animals, and the same with username insales and customers

19a

Page 61: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

sql

sql

AlternativesOther alternatives which are more explicit aboutcolumn names are:

select * from sales join animals on animals.id = sales.id join customers on customers.username = sales.username

select * from sales, animals, customers where animals.id = sales.id and customers.username = sales.username

select * from sales as s, animals as a, customers as c where a.id = s.id and c.username = s.username

19b

Page 62: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

ORMsIf you really don't like SQL, you can use an ORM(Object-Relational Mapping) such as sequelize

This allows you to access the data as objects, instead ofusing SQL

But I don't really recommend this approach, because (a)it doesn't give you the full power of SQL, only asimplified subset and (b) if you are only going to need asimplified subset, the SQL shouldn't be that difficult

20

Page 63: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

js

SQL InjectionSQL injection = SQL poisoning means attacking a site asa hacker, using clever text to pervert SQL queries

Suppose you join strings to build an SQL query:

a = url.query.address; q = "update people set address='" + a + "' where ...";

Suppose an attacker types this address into your form:

anywhere', auth='admin

They could give themselves admin privileges

21

Page 64: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Prepared statementsThe best way to guard against SQL injection, and tomake your queries more efficient, is to use preparedstatements, instead of db.run, do this:

var ps = db.prepare("update animals set breed=? where id=?"); await ps.run("terrier", 42); await ps.finalize();

The first line can be done once, when the server startsup, and the third line can be done once when the servershuts down (if you have a clean shutdown)

Note: this needs to be done for all query types,including select

22

Page 65: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

KeysA primary key is an id which identifies a row/record,e.g. student username or unit code

create table pets (name text, kindtext, primary key (name))

No two records can have the same primary key (andkeys can't be null)

A secondary key is a column (or group) which is keptunique by the database, e.g. student candidate number

23

Page 66: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

id breed42 dog53 fish

Foreign keysThere are no pointers, references are via keys

A foreign key is a column (or group) in one table whichreferences a primary key in another:

id username42 ab1234553 cd67890

Foreign key consistency means the foreign key columnalways refers to an existing primary key

24

Page 67: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

sql

Setting up foreign keyscreate table animals (id, ... primary key (id) ); create table sales (id, username, ... foreign key (id) references animals(id) ); pragma foreign_keys on;

25

Page 68: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Blobs and ClobsAn SQL database system usually has the ability tosupport Blobs (Binary large objects) and Clobs(Character large objects)

I suggest not using these features

For example, with photos, it is usually better to storeeach photo as a file, and put the filename into thedatabase

Generally, the database should hold the data which youuse to make decisions, not structureless data

26

Page 69: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

ConcurrencyA server is highly concurrent, able to handle manysimultaneous requests

A node server handles concurrency in a particularway

In the callback style, each request is divided intofunction calls, each ending with an I/O operation with acallback

The function calls from separate requests can beinterleaved, but two function calls are never executed atthe same time, and each function call is neverinterrupted

27

Page 70: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Concurrency exampleA server which does template-filling might look like:

function getTemplate(file, response) { fs.readFile(file,"utf8",ready); function ready(err,text) { getData(text,response); } } function getData(text,response) { db.get("select ...", ready); function ready(err,row) { fillTemplate(text,row,response); } } function fillTemplate(text,row,response) { // put row data into text and deliver response }

27a

Page 71: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Example continuedImagine two requests A and B arriving almost together

The sequence of function calls might be:

A getTemplate B getTemplate A getData B getData A fillTemplate B fillTemplate

Any interleaved sequence is possible, depending ontimings, and the server code must handle any of them,i.e. it must be thread-safe

27b

Page 72: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Global variablesGlobal variables in the server, shared by all thefunctions, are not generally thread-safe

var sqlite = require("sqlite"); var db;

The sqlite variable is safe because it is constant

The db variable is safe because functions are neverinterrupted, so a shared database connection is OK (butnot if you have multi-query transactions!)

In languages other than nodejs, where interruptionsmay happen, each simultaneous request needs its owndatabase connection (taken from a pool)

28

Page 73: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Example: DIY auto-idsIf you want a table where each row has an auto-generated id, you might try a global variable:

var nextId = 1; async function insert(...) { var query = "insert into ... values (id=?, ..."; var ps = db.prepare(query); await ps.run(nextId); nextId++; }

The global variable nextId would be OK if it wasincremented in the same function call as using it, buttechnically this is two function calls separated byawait, so it is not safe

29

Page 74: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Example: DB auto-idsSuppose instead you get the database to do it

In sqlite, that's by defining an integer primary keycolumn id, say

You only add the autoincrement keyword (whichreduces efficiency) if you want to forbid the reuse ofold id numbers

Then you do insert queries without providing the idcolumn

30

Page 75: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Example continuedAnd suppose, just after the insert, you want to knowwhat the id number is:

async function insert(...) { await db.run("insert ..."); var id = await db.get("select last_insert_rowid() as id ..."); }

This is not safe because another request could insert arow after the run, before the get

31

Page 76: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Example solvedSo what is the best way to sort out auto-ids?

Answer: scour the database documentation to find outhow to insert-and-fetch-id as an indivisible operation:

async function insert(...) { var x = await db.run("insert ..."); var id = x.lastID; }

32

Page 77: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

In-memory dataSuppose you want data available instantly as objects inmemory

This is more than just a caching system inside thedatabase, because it eliminates translation between dataformats

In the JavaScript world, systems like MongoDB can dothis, but you lose consistency guarantees in the process

You need a proper OODB, but there aren't any goodones

33

Page 78: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

ScaleScaling up is difficult

The most important thing is overall design

It is difficult to give good advice, because every projecthas different requirements, so let's just look at a fewissues in the context of some case studies

34

Page 79: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

FacebookFacebook is huge, so how do they organize their DB?

They use many different database systems for differentpurposes, but they mostly use SQL - their mostimportant data is relationships(!), so only SQL will do

They have one giant central database, and local copies

When you update, the central DB is updated, and yourbrowser is sent to the central database for 20 seconds,to give time (they hope!) for your update to reach yournearest local copy

35

Page 80: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

YahooYahoo use the PNUTS system, which works the otherway round

Your data is updated in your local database, whichprovides consistency guarantees for local data

Then the data is shared between the local databases toform a virtual global database

In that virtual database, consistency isn't guaranteed, butinconsistencies are extremely unlikely to matter

36

Page 81: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

SAFEThe SAFE system in the faculty was initially designedand set up by me, with a lot of help from Martin Bakerand later other people, though I no longer have anyresponsibility for it

Previously, the computer science web site (when itexisted) did all that SAFE does and more, and SAFE wasthe result of a challenge to do something similar at thefaculty level

SAFE is hideously complex (too complex, really) - itdoes all sorts of behind-the-scenes stuff

It is being retired at the same time as me!

37

Page 82: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Distributed or not?Perhaps the biggest question we faced was whetherSAFE would run on one computer, or several forrobustness or load balancing or database size reasons

We looked into the robustness issue, and our conclusionwas that multiple servers would make the system morecomplex and less robust

(True robustness is only possible with special andexpensive hardware, and a full time person to run thesystem - as 'amateurs' we couldn't afford either)

38

Page 83: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

RobustnessAlthough we couldn't use multiple servers forrobustness, we needed to do something

We opted for (a) keeping the database continually andsafely backed up and (b) having a backup computerready to take over

It is a cold backup site, not a hot one, because we can'tafford the hardware or staff

39

Page 84: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Load BalancingDid we need multiple servers for load balancing?

Load balancing is used, for example, in the university'sBlackboard system

But the lack of responsiveness of Blackboard is mainlydue to the depth of each request (the number of layersof software and networking it goes through to get downto the data) and load balancing doesn't help there

Multiple servers, and the probably necessary networkedaccess to the database that results, may make that worse

40

Page 85: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

LoadAssuming that network load is the efficiency bottleneck,could a single server cope with the load?

On the computer science system, the maximum load wemeasured was a quarter of a million hits in a day, i.e. 5per second

We would expect SAFE to peak at around 25 hits persecond

Given the university's good infrastructure, that isfeasible on a single server

41

Page 86: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Database sizeThat leaves database size as the only likely reason toneed multiple servers

Distributed databases are where NoSQL databases aresupposed to come into their own, and it is why they areadvertized as 'scalable'

But there is a problem known as the CAP theoremwhich means essentially that NoSQL databases cannotprovide consistency

42

Page 87: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Kind of dataNoSQL databases are brilliantly scalable if your data issuitable

They are scalable precisely because they give noconsistency guarantees

The type of data stored in SAFE (students, programmes,units, assignments, marks, ...) has huge amounts ofcross-referencing and huge consistency requirements

A NoSQL database could not possibly work

43

Page 88: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Local versus sharedBut the data held in the university as a whole is too bigto avoid distributed data

So, the university has a strategy: divide the data intocoherent local SQL databases, with each piece of dataonly being updated within its relevant database, andwith consistency guarantees within each database

The data which must be shared is duplicated in everydatabase by overnight copying (which works very well,though some data may be up to a day out of date)

44

Page 89: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Requirements for SAFESAFE is almost entirely responsible for day-to-day dataconcerning our faculty's students and units

But it needs central info about students from otherfaculties taking our units, and units from other facultiestaken by our students

And it needs to report some of our faculty's data forcentral storage, to be shared by other faculties

This is all done by automatic overnight processes

45

Page 90: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Cross-database queriesPreviously, we had done some cross-database queries inorder to fetch data overnight

An example might be "fetch the records from a centraltable which are relevant to the students listed in ourlocal table"

These queries only work when both databases use thesame system, and we wanted to use a different databasesystem for SAFE

46

Page 91: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

ExperimentTo check the feasibility of switching to a differentdatabase system, we did an experiment

We took our biggest cross-database query and replacedit with a query on the remote database alone of theform "fetch all the data from the central table", and wewrote code to check each record as it arrived, discard itif not relevant, and update our own database if relevant

To our surprise, this was 100 times more efficient

Cross-database queries are an exceedingly crudemechanism, of use only to non-programmers

47

Page 92: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Database sizeThe SAFE database is exceedingly complex (hundreds oftables with hundreds of cross-reference fields) and large(hundreds of thousands of records in the largest tables)

But, by moving all the bulk unstructured data out intofiles, the total size is about a gigabyte, which easily fitsin one server's memory, let alone one server's disk

So, the final design decision was: use a single serverwith a single SQL database

48

Page 93: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Embedded or served?Our previous experience had been with serveddatabases, because that's what the university uses (it hasa site-wide Oracle licence)

But we knew from experience that the network is theefficiency bottleneck, and that using a served databasecan roughly double the network load

So one of the main reasons for wanting to stick to asingle server and database was so that the databasecould be embedded

49

Page 94: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

QueriesThe SAFE database is a Derby one, within a Java server

How efficient are queries?

If you use prepared statements (as you always should)the first time you execute a query, it is slow, and afterthat, it is blindingly fast

That's because Derby compiles the query to Javabytecode, and then Java's JIT compiler compiles it tomachine code!

50

Page 95: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Query bottleneckBut we knew from experience that database querieswould probably form a second efficiency bottleneck

So, we planned to do better

On a system like SAFE, the number of read-onlyaccesses vastly outweighs the number of updateaccesses, which gives an opportunity for improvement

The improvement is to hold the data in memory asobjects, and use the database as a write-through cache

51

Page 96: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

OODBThe result is effectively an Object Oriented Database,with these features

read-only accesses involve no copying of dataread-only accesses involve no memory turnoverACID transactions are supported in memoryindexes are implemented in memory

We did a survey, and found that there were no suitableOODB systems (either they involved copying, or theyleft transactions up to the programmer)

52

Page 97: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

DIYI designed an in-memory database (called Adapt, nowdefunct)

Since SAFE was set up, JPA has come along, which hasthe right properties

But if you look into it, it is hideously complex (morethan Adapt, which is already too complex)

After a lot of thought and experiment, my conclusion isthat Java's flaws prevent a simple approach (and noother OO language is better)

53

Page 98: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

ThreadsAlthough we went for a single server, we still wantedthreads

JavaScript's asynchronous approach effectively providespseudo-threads, but real threads are needed in Java -SAFE has 200

The original SAFE server was an 8x8 processor (8 cores,8 threads each), which speeds up the server by a factorof between 8 and 64

There are no problems because Java, Derby and Adaptare all designed to handle many threads

54

Page 99: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

JavaScriptBut what if you want to go multi-threaded in theJavaScript world?

Node is single-threaded, though it does an excellent jobof handling simultaneous queries with a single thread(effectively providing pseudo-threads)

But it can't make use of multiple cores, it only ever usesone of them, so how can you speed things up?

55

Page 100: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

NodeThe standard way to go multi-threaded with Node is torun several copies of your server, one on each processorcore (technically, these are separate processes)

You need a load-balancer to direct each incomingrequest to one of the copies

You can use a commercial one such as nginx, but youcan write a simple one yourself in five lines of code

56

Page 101: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

SessionsIt is important to keep the communication betweenserver copies to a minimum

One issue is sessions; a session is a way of rememberinga login, plus some details of the logged-in user

A session can be treated as volatile, i.e. stored only inmemory while the server is running, and discardedwhen the server goes down

57

Page 102: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Local SessionsSharing sessions between server copies is a severecomplication

You can avoid it by arranging for the load balancer tosend requests from the same user to the same copy ofthe server - the only one which holds that user's sessiondata

What's needed is for the load balancer to hold a mapfrom session ids to server copy ids (so now the loadbalancer will be ten lines of code instead of five!)

58

Page 103: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Sharing the databaseThe only issue left is for the server copies to shareaccess to the database

Anything that the servers need to share can be put inthe database

The simplest option is to switch to a served database,e.g. from sqlite3 to postgresql, say

This isn't a big deal, especially compared to the overallamount of work needed to design and build a scaled-upsystem

59

Page 104: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Data APIsThere is another option which is worth considering

Lots of sites provide a data API (Google Maps, socialnetworking sites, ...)

You send a request asking for data, and you get back thedata you asked for

Each uses a customized protocol, based on the design ofthe data which the site holds

Each protocol is extremely simple, nothing like thegeneral querying protocol provided by a served database

60

Page 105: Databases - GitLab · an SQL versus a NoSQL database system The short summary for this unit is going to be that you should choose an embedded database system, and you should choose

Custom databaseFollowing the API model, what you can do is to run onedatabase process alongside the server processes

It isn't a served database, but instead an embeddeddatabase with code which you write to provide a dataAPI for your data

So the database may as well be sqlite3

The server copies each use the data API to interact withthe database process

This is probably as simple and efficient as you can get

61